Artificial Intelligence 139
☆ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
In this work, we introduce Mini-Gemini, a simple and effective framework
enhancing multi-modality Vision Language Models (VLMs). Despite the
advancements in VLMs facilitating basic visual dialog and reasoning, a
performance gap persists compared to advanced models like GPT-4 and Gemini. We
try to narrow the gap by mining the potential of VLMs for better performance
and any-to-any workflow from three aspects, i.e., high-resolution visual
tokens, high-quality data, and VLM-guided generation. To enhance visual tokens,
we propose to utilize an additional visual encoder for high-resolution
refinement without increasing the visual token count. We further construct a
high-quality dataset that promotes precise image comprehension and
reasoning-based generation, expanding the operational scope of current VLMs. In
general, Mini-Gemini further mines the potential of VLMs and empowers current
frameworks with image understanding, reasoning, and generation simultaneously.
Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)
from 2B to 34B. It is demonstrated to achieve leading performance in several
zero-shot benchmarks and even surpasses the developed private models. Code and
models are available at https://github.com/dvlab-research/MiniGemini.
comment: Code and models are available at
https://github.com/dvlab-research/MiniGemini
☆ ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation CVPR
In the absence of parallax cues, a learning-based single image depth
estimation (SIDE) model relies heavily on shading and contextual cues in the
image. While this simplicity is attractive, it is necessary to train such
models on large and varied datasets, which are difficult to capture. It has
been shown that using embeddings from pre-trained foundational models, such as
CLIP, improves zero shot transfer in several applications. Taking inspiration
from this, in our paper we explore the use of global image priors generated
from a pre-trained ViT model to provide more detailed contextual information.
We argue that the embedding vector from a ViT model, pre-trained on a large
dataset, captures greater relevant information for SIDE than the usual route of
generating pseudo image captions, followed by CLIP based text embeddings. Based
on this idea, we propose a new SIDE model using a diffusion backbone which is
conditioned on ViT embeddings. Our proposed design establishes a new
state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of
0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on
KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to
0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model
trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%)
over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%,
18%, 45%, 9%) by ZoeDepth. The code is available at
https://github.com/Aradhye2002/EcoDepth.
comment: Accepted at IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) 2024
☆ Long-form factuality in large language models
Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le
Large language models (LLMs) often generate content that contains factual
errors when responding to fact-seeking prompts on open-ended topics. To
benchmark a model's long-form factuality in open domains, we first use GPT-4 to
generate LongFact, a prompt set comprising thousands of questions spanning 38
topics. We then propose that LLM agents can be used as automated evaluators for
long-form factuality through a method which we call Search-Augmented Factuality
Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into
a set of individual facts and to evaluate the accuracy of each fact using a
multi-step reasoning process comprising sending search queries to Google Search
and determining whether a fact is supported by the search results. Furthermore,
we propose extending F1 score as an aggregated metric for long-form factuality.
To do so, we balance the percentage of supported facts in a response
(precision) with the percentage of provided facts relative to a hyperparameter
representing a user's preferred response length (recall).
Empirically, we demonstrate that LLM agents can achieve superhuman rating
performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced
human annotators 72% of the time, and on a random subset of 100 disagreement
cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times
cheaper than human annotators. We also benchmark thirteen language models on
LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding
that larger language models generally achieve better long-form factuality.
LongFact, SAFE, and all experimental code are available at
https://github.com/google-deepmind/long-form-factuality.
☆ Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction
We tackle the challenge of efficiently reconstructing a 3D asset from a
single image with growing demands for automated 3D content creation pipelines.
Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural
Radiance Fields (NeRF). Despite their significant success, these approaches
encounter practical limitations due to lengthy optimization and considerable
memory usage. In this report, we introduce Gamba, an end-to-end amortized 3D
reconstruction model from single-view images, emphasizing two main insights:
(1) 3D representation: leveraging a large number of 3D Gaussians for an
efficient 3D Gaussian splatting process; (2) Backbone design: introducing a
Mamba-based sequential network that facilitates context-dependent reasoning and
linear scalability with the sequence (token) length, accommodating a
substantial number of Gaussians. Gamba incorporates significant advancements in
data preprocessing, regularization design, and training methodologies. We
assessed Gamba against existing optimization-based and feed-forward 3D
generation approaches using the real-world scanned OmniObject3D dataset. Here,
Gamba demonstrates competitive generation capabilities, both qualitatively and
quantitatively, while achieving remarkable speed, approximately 0.6 second on a
single NVIDIA A100 GPU.
☆ ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object CVPR 2024
We establish rigorous benchmarks for visual perception robustness. Synthetic
images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific
type of evaluation over synthetic corruptions, backgrounds, and textures, yet
those robustness benchmarks are restricted in specified variations and have low
synthetic quality. In this work, we introduce generative model as a data source
for synthesizing hard images that benchmark deep models' robustness. Leveraging
diffusion models, we are able to generate images with more diversified
backgrounds, textures, and materials than any prior work, where we term this
benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a
significant accuracy drop to a range of vision models, from the standard ResNet
visual classifier to the latest foundation models like CLIP and MiniGPT-4,
significantly reducing their accuracy by up to 60\%. Our work suggests that
diffusion models can be an effective source to test vision models. The code and
dataset are available at https://github.com/chenshuang-zhang/imagenet_d.
comment: Accepted at CVPR 2024
☆ Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means
This paper introduces a novel K-means clustering algorithm, an advancement on
the conventional Big-means methodology. The proposed method efficiently
integrates parallel processing, stochastic sampling, and competitive
optimization to create a scalable variant designed for big data applications.
It addresses scalability and computation time challenges typically faced with
traditional techniques. The algorithm adjusts sample sizes dynamically for each
worker during execution, optimizing performance. Data from these sample sizes
are continually analyzed, facilitating the identification of the most efficient
configuration. By incorporating a competitive element among workers using
different sample sizes, efficiency within the Big-means algorithm is further
stimulated. In essence, the algorithm balances computational time and
clustering quality by employing a stochastic, competitive sampling strategy in
a parallel computing setting.
☆ ModaLink: Unifying Modalities for Efficient Image-to-PointCloud Place Recognition
Weidong Xie, Lun Luo, Nanfei Ye, Yi Ren, Shaoyi Du, Minhang Wang, Jintao Xu, Rui Ai, Weihao Gu, Xieyuanli Chen
Place recognition is an important task for robots and autonomous cars to
localize themselves and close loops in pre-built maps. While single-modal
sensor-based methods have shown satisfactory performance, cross-modal place
recognition that retrieving images from a point-cloud database remains a
challenging problem. Current cross-modal methods transform images into 3D
points using depth estimation for modality conversion, which are usually
computationally intensive and need expensive labeled data for depth
supervision. In this work, we introduce a fast and lightweight framework to
encode images and point clouds into place-distinctive descriptors. We propose
an effective Field of View (FoV) transformation module to convert point clouds
into an analogous modality as images. This module eliminates the necessity for
depth estimation and helps subsequent modules achieve real-time performance. We
further design a non-negative factorization-based encoder to extract mutually
consistent semantic features between point clouds and images. This encoder
yields more distinctive global descriptors for retrieval. Experimental results
on the KITTI dataset show that our proposed methods achieve state-of-the-art
performance while running in real time. Additional evaluation on the HAOMO
dataset covering a 17 km trajectory further shows the practical generalization
capabilities. We have released the implementation of our methods as open source
at: https://github.com/haomo-ai/ModaLink.git.
comment: 8 pages, 11 figures, conference
☆ Detection of subclinical atherosclerosis by image-based deep learning on chest x-ray
Guglielmo Gallone, Francesco Iodice, Alberto Presta, Davide Tore, Ovidio de Filippo, Michele Visciano, Carlo Alberto Barbano, Alessandro Serafini, Paola Gorrini, Alessandro Bruno, Walter Grosso Marra, James Hughes, Mario Iannaccone, Paolo Fonio, Attilio Fiandrotti, Alessandro Depaoli, Marco Grangetto, Gaetano Maria de Ferrari, Fabrizio D'Ascenzo
Aims. To develop a deep-learning based system for recognition of subclinical
atherosclerosis on a plain frontal chest x-ray. Methods and Results. A
deep-learning algorithm to predict coronary artery calcium (CAC) score (the
AI-CAC model) was developed on 460 chest x-ray (80% training cohort, 20%
internal validation cohort) of primary prevention patients (58.4% male, median
age 63 [51-74] years) with available paired chest x-ray and chest computed
tomography (CT) indicated for any clinical reason and performed within 3
months. The CAC score calculated on chest CT was used as ground truth. The
model was validated on an temporally-independent cohort of 90 patients from the
same institution (external validation). The diagnostic accuracy of the AI-CAC
model assessed by the area under the curve (AUC) was the primary outcome.
Overall, median AI-CAC score was 35 (0-388) and 28.9% patients had no AI-CAC.
AUC of the AI-CAC model to identify a CAC>0 was 0.90 in the internal validation
cohort and 0.77 in the external validation cohort. Sensitivity was consistently
above 92% in both cohorts. In the overall cohort (n=540), among patients with
AI-CAC=0, a single ASCVD event occurred, after 4.3 years. Patients with
AI-CAC>0 had significantly higher Kaplan Meier estimates for ASCVD events
(13.5% vs. 3.4%, log-rank=0.013). Conclusion. The AI-CAC model seems to
accurately detect subclinical atherosclerosis on chest x-ray with elevated
sensitivity, and to predict ASCVD events with elevated negative predictive
value. Adoption of the AI-CAC model to refine CV risk stratification or as an
opportunistic screening tool requires prospective evaluation.
comment: Submitted to European Heart Journal - Cardiovascular Imaging Added
also the additional material 44 pages (30 main paper, 14 additional
material), 14 figures (5 main manuscript, 9 additional material)
☆ Many-Objective Evolutionary Influence Maximization: Balancing Spread, Budget, Fairness, and Time GECCO 24
The Influence Maximization (IM) problem seeks to discover the set of nodes in
a graph that can spread the information propagation at most. This problem is
known to be NP-hard, and it is usually studied by maximizing the influence
(spread) and, optionally, optimizing a second objective, such as minimizing the
seed set size or maximizing the influence fairness. However, in many practical
scenarios multiple aspects of the IM problem must be optimized at the same
time. In this work, we propose a first case study where several IM-specific
objective functions, namely budget, fairness, communities, and time, are
optimized on top of the maximization of influence and minimization of the seed
set size. To this aim, we introduce MOEIM (Many-Objective Evolutionary
Algorithm for Influence Maximization) a Multi-Objective Evolutionary Algorithm
(MOEA) based on NSGA-II incorporating graph-aware operators and a smart
initialization. We compare MOEIM in two experimental settings, including a
total of nine graph datasets, two heuristic methods, a related MOEA, and a
state-of-the-art Deep Learning approach. The experiments show that MOEIM
overall outperforms the competitors in most of the tested many-objective
settings. To conclude, we also investigate the correlation between the
objectives, leading to novel insights into the topic. The codebase is available
at https://github.com/eliacunegatti/MOEIM.
comment: To appear in Genetic and Evolutionary Computation Conference (GECCO
24 Companion), July 14 18, 2024, Melbourne, VIC, Australia. ACM, New York,
NY, USA
☆ Understanding the Learning Dynamics of Alignment with Human Feedback
Aligning large language models (LLMs) with human intentions has become a
critical task for safely deploying models in real-world systems. While existing
alignment approaches have seen empirical success, theoretically understanding
how these methods affect model behavior remains an open question. Our work
provides an initial attempt to theoretically analyze the learning dynamics of
human preference alignment. We formally show how the distribution of preference
datasets influences the rate of model updates and provide rigorous guarantees
on the training accuracy. Our theory also reveals an intricate phenomenon where
the optimization is prone to prioritizing certain behaviors with higher
preference distinguishability. We empirically validate our findings on
contemporary LLMs and alignment tasks, reinforcing our theoretical insights and
shedding light on considerations for future alignment approaches. Disclaimer:
This paper contains potentially offensive text; reader discretion is advised.
☆ Enhancing Manufacturing Quality Prediction Models through the Integration of Explainability Methods
This research presents a method that utilizes explainability techniques to
amplify the performance of machine learning (ML) models in forecasting the
quality of milling processes, as demonstrated in this paper through a
manufacturing use case. The methodology entails the initial training of ML
models, followed by a fine-tuning phase where irrelevant features identified
through explainability methods are eliminated. This procedural refinement
results in performance enhancements, paving the way for potential reductions in
manufacturing costs and a better understanding of the trained ML models. This
study highlights the usefulness of explainability techniques in both explaining
and optimizing predictive models in the manufacturing realm.
☆ Probabilistic Model Checking of Stochastic Reinforcement Learning Policies
We introduce a method to verify stochastic reinforcement learning (RL)
policies. This approach is compatible with any RL algorithm as long as the
algorithm and its corresponding environment collectively adhere to the Markov
property. In this setting, the future state of the environment should depend
solely on its current state and the action executed, independent of any
previous states or actions. Our method integrates a verification technique,
referred to as model checking, with RL, leveraging a Markov decision process, a
trained RL policy, and a probabilistic computation tree logic (PCTL) formula to
build a formal model that can be subsequently verified via the model checker
Storm. We demonstrate our method's applicability across multiple benchmarks,
comparing it to baseline methods called deterministic safety estimates and
naive monolithic model checking. Our results show that our method is suited to
verify stochastic RL policies.
☆ Semi-Supervised Learning for Deep Causal Generative Models
Developing models that can answer questions of the form "How would $x$ change
if $y$ had been $z$?" is fundamental for advancing medical image analysis.
Training causal generative models that address such counterfactual questions,
though, currently requires that all relevant variables have been observed and
that corresponding labels are available in training data. However, clinical
data may not have complete records for all patients and state of the art causal
generative models are unable to take full advantage of this. We thus develop,
for the first time, a semi-supervised deep causal generative model that
exploits the causal relationships between variables to maximise the use of all
available data. We explore this in the setting where each sample is either
fully labelled or fully unlabelled, as well as the more clinically realistic
case of having different labels missing for each sample. We leverage techniques
from causal inference to infer missing values and subsequently generate
realistic counterfactuals, even for samples with incomplete labels.
☆ Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
Large Vision-Language Models (LVLMs) are increasingly adept at generating
contextually detailed and coherent responses from visual inputs. However, their
application in multimodal decision-making and open-ended generation is hindered
by a notable rate of hallucinations, where generated text inaccurately
represents the visual contents. To address this issue, this paper introduces
the Instruction Contrastive Decoding (ICD) method, a novel approach designed to
reduce hallucinations during LVLM inference. Our method is inspired by our
observation that what we call disturbance instructions significantly exacerbate
hallucinations in multimodal fusion modules. ICD contrasts distributions from
standard and instruction disturbance, thereby increasing alignment uncertainty
and effectively subtracting hallucinated concepts from the original
distribution. Through comprehensive experiments on discriminative benchmarks
(POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that
ICD significantly mitigates both object-level and attribute-level
hallucinations. Moreover, our method not only addresses hallucinations but also
significantly enhances the general perception and recognition capabilities of
LVLMs.
☆ SAT-NGP : Unleashing Neural Graphics Primitives for Fast Relightable Transient-Free 3D reconstruction from Satellite Imagery
Current stereo-vision pipelines produce high accuracy 3D reconstruction when
using multiple pairs or triplets of satellite images. However, these pipelines
are sensitive to the changes between images that can occur as a result of
multi-date acquisitions. Such variations are mainly due to variable shadows,
reflexions and transient objects (cars, vegetation). To take such changes into
account, Neural Radiance Fields (NeRF) have recently been applied to multi-date
satellite imagery. However, Neural methods are very compute-intensive, taking
dozens of hours to learn, compared with minutes for standard stereo-vision
pipelines. Following the ideas of Instant Neural Graphics Primitives we propose
to use an efficient sampling strategy and multi-resolution hash encoding to
accelerate the learning. Our model, Satellite Neural Graphics Primitives
(SAT-NGP) decreases the learning time to 15 minutes while maintaining the
quality of the 3D reconstruction.
comment: 5 pages, 3 figures, 1 table; Accepted to International Geoscience and
Remote Sensing Symposium (IGARSS) 2024; Code available at
https://github.com/Ellimac0/SAT-NGP
☆ Contrastive Learning with Orthonormal Anchors (CLOA)
This study focuses on addressing the instability issues prevalent in
contrastive learning, specifically examining the InfoNCE loss function and its
derivatives. We reveal a critical observation that these loss functions exhibit
a restrictive behavior, leading to a convergence phenomenon where embeddings
tend to merge into a singular point. This "over-fusion" effect detrimentally
affects classification accuracy in subsequent supervised-learning tasks.
Through theoretical analysis, we demonstrate that embeddings, when equalized or
confined to a rank-1 linear subspace, represent a local minimum for InfoNCE. In
response to this challenge, our research introduces an innovative strategy that
leverages the same or fewer labeled data than typically used in the fine-tuning
phase. The loss we proposed, Orthonormal Anchor Regression Loss, is designed to
disentangle embedding clusters, significantly enhancing the distinctiveness of
each embedding while simultaneously ensuring their aggregation into dense,
well-defined clusters. Our method demonstrates remarkable improvements with
just a fraction of the conventional label requirements, as evidenced by our
results on CIFAR10 and CIFAR100 datasets.
comment: 11 pages, 4 figures
☆ Annolid: Annotate, Segment, and Track Anything You Need
Annolid is a deep learning-based software package designed for the
segmentation, labeling, and tracking of research targets within video files,
focusing primarily on animal behavior analysis. Based on state-of-the-art
instance segmentation methods, Annolid now harnesses the Cutie video object
segmentation model to achieve resilient, markerless tracking of multiple
animals from single annotated frames, even in environments in which they may be
partially or entirely concealed by environmental features or by one another.
Our integration of Segment Anything and Grounding-DINO strategies additionally
enables the automatic masking and segmentation of recognizable animals and
objects by text command, removing the need for manual annotation. Annolid's
comprehensive approach to object segmentation flexibly accommodates a broad
spectrum of behavior analysis applications, enabling the classification of
diverse behavioral states such as freezing, digging, pup huddling, and social
interactions in addition to the tracking of animals and their body parts.
☆ TransFusion: Contrastive Learning with Transformers
This paper proposes a novel framework, TransFusion, designed to make the
process of contrastive learning more analytical and explainable. TransFusion
consists of attention blocks whose softmax being replaced by ReLU, and its
final block's weighted-sum operation is truncated to leave the adjacency matrix
as the output. The model is trained by minimizing the Jensen-Shannon Divergence
between its output and the target affinity matrix, which indicates whether each
pair of samples belongs to the same or different classes. The main contribution
of TransFusion lies in defining a theoretical limit for answering two
fundamental questions in the field: the maximum level of data augmentation and
the minimum batch size required for effective contrastive learning.
Furthermore, experimental results indicate that TransFusion successfully
extracts features that isolate clusters from complex real-world data, leading
to improved classification accuracy in downstream tasks.
comment: 17 pages, 4 figures,
☆ Aiming for Relevance
Vital signs are crucial in intensive care units (ICUs). They are used to
track the patient's state and to identify clinically significant changes.
Predicting vital sign trajectories is valuable for early detection of adverse
events. However, conventional machine learning metrics like RMSE often fail to
capture the true clinical relevance of such predictions. We introduce novel
vital sign prediction performance metrics that align with clinical contexts,
focusing on deviations from clinical norms, overall trends, and trend
deviations. These metrics are derived from empirical utility curves obtained in
a previous study through interviews with ICU clinicians. We validate the
metrics' usefulness using simulated and real clinical datasets (MIMIC and
eICU). Furthermore, we employ these metrics as loss functions for neural
networks, resulting in models that excel in predicting clinically significant
events. This research paves the way for clinically relevant machine learning
model evaluation and optimization, promising to improve ICU patient care. 10
pages, 9 figures.
comment: 10 pages, 9 figures, AMIA Informatics 2024
☆ INEXA: Interactive and Explainable Process Model Abstraction Through Object-Centric Process Mining
Process events are recorded by multiple information systems at different
granularity levels. Based on the resulting event logs, process models are
discovered at different granularity levels, as well. Events stored at a
fine-grained granularity level, for example, may hinder the discovered process
model to be displayed due the high number of resulting model elements. The
discovered process model of a real-world manufacturing process, for example,
consists of 1,489 model elements and over 2,000 arcs. Existing process model
abstraction techniques could help reducing the size of the model, but would
disconnect it from the underlying event log. Existing event abstraction
techniques do neither support the analysis of mixed granularity levels, nor
interactive exploration of a suitable granularity level. To enable the
exploration of discovered process models at different granularity levels, we
propose INEXA, an interactive, explainable process model abstraction method
that keeps the link to the event log. As a starting point, INEXA aggregates
large process models to a "displayable" size, e.g., for the manufacturing use
case to a process model with 58 model elements. Then, the process analyst can
explore granularity levels interactively, while applied abstractions are
automatically traced in the event log for explainability.
☆ Spikewhisper: Temporal Spike Backdoor Attacks on Federated Neuromorphic Learning over Low-power Devices
Federated neuromorphic learning (FedNL) leverages event-driven spiking neural
networks and federated learning frameworks to effectively execute intelligent
analysis tasks over amounts of distributed low-power devices but also perform
vulnerability to poisoning attacks. The threat of backdoor attacks on
traditional deep neural networks typically comes from time-invariant data.
However, in FedNL, unknown threats may be hidden in time-varying spike signals.
In this paper, we start to explore a novel vulnerability of FedNL-based systems
with the concept of time division multiplexing, termed Spikewhisper, which
allows attackers to evade detection as much as possible, as multiple malicious
clients can imperceptibly poison with different triggers at different
timeslices. In particular, the stealthiness of Spikewhisper is derived from the
time-domain divisibility of global triggers, in which each malicious client
pastes only one local trigger to a certain timeslice in the neuromorphic
sample, and also the polarity and motion of each local trigger can be
configured by attackers. Extensive experiments based on two different
neuromorphic datasets demonstrate that the attack success rate of Spikewispher
is higher than the temporally centralized attacks. Besides, it is validated
that the effect of Spikewispher is sensitive to the trigger duration.
☆ RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Procedure Planning in instructional videos entails generating a sequence of
action steps based on visual observations of the initial and target states.
Despite the rapid progress in this task, there remain several critical
challenges to be solved: (1) Adaptive procedures: Prior works hold an
unrealistic assumption that the number of action steps is known and fixed,
leading to non-generalizable models in real-world scenarios where the sequence
length varies. (2) Temporal relation: Understanding the step temporal relation
knowledge is essential in producing reasonable and executable plans. (3)
Annotation cost: Annotating instructional videos with step-level labels (i.e.,
timestamp) or sequence-level labels (i.e., action category) is demanding and
labor-intensive, limiting its generalizability to large-scale datasets.In this
work, we propose a new and practical setting, called adaptive procedure
planning in instructional videos, where the procedure length is not fixed or
pre-determined. To address these challenges we introduce Retrieval-Augmented
Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively
determines the conclusion of actions using an auto-regressive model
architecture. For temporal relation, RAP establishes an external memory module
to explicitly retrieve the most relevant state-action pairs from the training
videos and revises the generated procedures. To tackle high annotation cost,
RAP utilizes a weakly-supervised learning manner to expand the training dataset
to other task-relevant, unannotated videos by generating pseudo labels for
action steps. Experiments on CrossTask and COIN benchmarks show the superiority
of RAP over traditional fixed-length models, establishing it as a strong
baseline solution for adaptive procedure planning.
comment: 23 pages, 6 figures, 12 tables
☆ Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding
The tokenizer, as one of the fundamental components of large models, has long
been overlooked or even misunderstood in visual tasks. One key factor of the
great comprehension power of the large language model is that natural language
tokenizers utilize meaningful words or subwords as the basic elements of
language. In contrast, mainstream visual tokenizers, represented by patch-based
methods such as Patch Embed, rely on meaningless rectangular patches as basic
elements of vision, which cannot serve as effectively as words or subwords in
language. Starting from the essence of the tokenizer, we defined semantically
independent regions (SIRs) for vision. We designed a simple HOmogeneous visual
tOKenizer: HOOK. HOOK mainly consists of two modules: the Object Perception
Module (OPM) and the Object Vectorization Module (OVM). To achieve homogeneity,
the OPM splits the image into 4*4 pixel seeds and then utilizes the attention
mechanism to perceive SIRs. The OVM employs cross-attention to merge seeds
within the same SIR. To achieve adaptability, the OVM defines a variable number
of learnable vectors as cross-attention queries, allowing for the adjustment of
token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19
classification dataset, and GID5 segmentation dataset for sparse and dense
tasks. The results demonstrate that the visual tokens obtained by HOOK
correspond to individual objects, which demonstrates homogeneity. HOOK
outperformed Patch Embed by 6\% and 10\% in the two tasks and achieved
state-of-the-art performance compared to the baselines used for comparison.
Compared to Patch Embed, which requires more than one hundred tokens for one
image, HOOK requires only 6 and 8 tokens for sparse and dense tasks,
respectively, resulting in efficiency improvements of 1.5 to 2.8 times. The
code is available at https://github.com/GeoX-Lab/Hook.
comment: 20 pages, 8 figures, 6 tables
☆ Physics-Informed Graph Neural Networks for Water Distribution Systems AAAI
Water distribution systems (WDS) are an integral part of critical
infrastructure which is pivotal to urban development. As 70% of the world's
population will likely live in urban environments in 2050, efficient simulation
and planning tools for WDS play a crucial role in reaching UN's sustainable
developmental goal (SDG) 6 - "Clean water and sanitation for all". In this
realm, we propose a novel and efficient machine learning emulator, more
precisely, a physics-informed deep learning (DL) model, for hydraulic state
estimation in WDS. Using a recursive approach, our model only needs a few graph
convolutional neural network (GCN) layers and employs an innovative algorithm
based on message passing. Unlike conventional machine learning tasks, the model
uses hydraulic principles to infer two additional hydraulic state features in
the process of reconstructing the available ground truth feature in an
unsupervised manner. To the best of our knowledge, this is the first DL
approach to emulate the popular hydraulic simulator EPANET, utilizing no
additional information. Like most DL models and unlike the hydraulic simulator,
our model demonstrates vastly faster emulation times that do not increase
drastically with the size of the WDS. Moreover, we achieve high accuracy on the
ground truth and very similar results compared to the hydraulic simulator as
demonstrated through experiments on five real-world WDS datasets.
comment: Extended version of the paper with the same title published at
Proceedings of the AAAI Conference on Artificial Intelligence 2024
☆ PDNNet: PDN-Aware GNN-CNN Heterogeneous Network for Dynamic IR Drop Prediction
IR drop on the power delivery network (PDN) is closely related to PDN's
configuration and cell current consumption. As the integrated circuit (IC)
design is growing larger, dynamic IR drop simulation becomes computationally
unaffordable and machine learning based IR drop prediction has been explored as
a promising solution. Although CNN-based methods have been adapted to IR drop
prediction task in several works, the shortcomings of overlooking PDN
configuration is non-negligible. In this paper, we consider not only how to
properly represent cell-PDN relation, but also how to model IR drop following
its physical nature in the feature aggregation procedure. Thus, we propose a
novel graph structure, PDNGraph, to unify the representations of the PDN
structure and the fine-grained cell-PDN relation. We further propose a
dual-branch heterogeneous network, PDNNet, incorporating two parallel GNN-CNN
branches to favorably capture the above features during the learning process.
Several key designs are presented to make the dynamic IR drop prediction highly
effective and interpretable. We are the first work to apply graph structure to
deep-learning based dynamic IR drop prediction method. Experiments show that
PDNNet outperforms the state-of-the-art CNN-based methods by up to 39.3%
reduction in prediction error and achieves 545x speedup compared to the
commercial tool, which demonstrates the superiority of our method.
☆ Neural Architecture Search for Sentence Classification with BERT
Pre training of language models on large text corpora is common practice in
Natural Language Processing. Following, fine tuning of these models is
performed to achieve the best results on a variety of tasks. In this paper we
question the common practice of only adding a single output layer as a
classification head on top of the network. We perform an AutoML search to find
architectures that outperform the current single layer at only a small compute
cost. We validate our classification architecture on a variety of NLP
benchmarks from the GLUE dataset.
☆ Efficient Heatmap-Guided 6-Dof Grasp Detection in Cluttered Scenes
Fast and robust object grasping in clutter is a crucial component of
robotics. Most current works resort to the whole observed point cloud for 6-Dof
grasp generation, ignoring the guidance information excavated from global
semantics, thus limiting high-quality grasp generation and real-time
performance. In this work, we show that the widely used heatmaps are
underestimated in the efficiency of 6-Dof grasp generation. Therefore, we
propose an effective local grasp generator combined with grasp heatmaps as
guidance, which infers in a global-to-local semantic-to-point way.
Specifically, Gaussian encoding and the grid-based strategy are applied to
predict grasp heatmaps as guidance to aggregate local points into graspable
regions and provide global semantic information. Further, a novel non-uniform
anchor sampling mechanism is designed to improve grasp accuracy and diversity.
Benefiting from the high-efficiency encoding in the image space and focusing on
points in local graspable regions, our framework can perform high-quality grasp
detection in real-time and achieve state-of-the-art results. In addition, real
robot experiments demonstrate the effectiveness of our method with a success
rate of 94% and a clutter completion rate of 100%. Our code is available at
https://github.com/THU-VCLab/HGGD.
comment: Extensive results on GraspNet-1B dataset
☆ A Path Towards Legal Autonomy: An interoperable and explainable approach to extracting, transforming, loading and computing legal information using large language models, expert systems and Bayesian networks
Axel Constant, Hannes Westermann, Bryan Wilson, Alex Kiefer, Ines Hipolito, Sylvain Pronovost, Steven Swanson, Mahault Albarracin, Maxwell J. D. Ramstead
Legal autonomy - the lawful activity of artificial intelligence agents - can
be achieved in one of two ways. It can be achieved either by imposing
constraints on AI actors such as developers, deployers and users, and on AI
resources such as data, or by imposing constraints on the range and scope of
the impact that AI agents can have on the environment. The latter approach
involves encoding extant rules concerning AI driven devices into the software
of AI agents controlling those devices (e.g., encoding rules about limitations
on zones of operations into the agent software of an autonomous drone device).
This is a challenge since the effectivity of such an approach requires a method
of extracting, loading, transforming and computing legal information that would
be both explainable and legally interoperable, and that would enable AI agents
to reason about the law. In this paper, we sketch a proof of principle for such
a method using large language models (LLMs), expert legal systems known as
legal decision paths, and Bayesian networks. We then show how the proposed
method could be applied to extant regulation in matters of autonomous cars,
such as the California Vehicle Code.
☆ A Novel Behavior-Based Recommendation System for E-commerce
Reza Barzegar Nozari, Mahdi Divsalar, Sepehr Akbarzadeh Abkenar, Mohammadreza Fadavi Amiri, Ali Divsalar
The majority of existing recommender systems rely on user ratings, which are
limited by the lack of user collaboration and the sparsity problem. To address
these issues, this study proposes a behavior-based recommender system that
leverages customers' natural behaviors, such as browsing and clicking, on
e-commerce platforms. The proposed recommendation system involves clustering
active customers, determining neighborhoods, collecting similar users,
calculating product reputation based on similar users, and recommending
high-reputation products. To overcome the complexity of customer behaviors and
traditional clustering methods, an unsupervised clustering approach based on
product categories is developed to enhance the recommendation methodology. This
study makes notable contributions in several aspects. Firstly, a groundbreaking
behavior-based recommendation methodology is developed, incorporating customer
behavior to generate accurate and tailored recommendations leading to improved
customer satisfaction and engagement. Secondly, an original unsupervised
clustering method, focusing on product categories, enables more precise
clustering and facilitates accurate recommendations. Finally, an approach to
determine neighborhoods for active customers within clusters is established,
ensuring grouping of customers with similar behavioral patterns to enhance
recommendation accuracy and relevance. The proposed recommendation methodology
and clustering method contribute to improved recommendation performance,
offering valuable insights for researchers and practitioners in the field of
e-commerce recommendation systems. Additionally, the proposed method
outperforms benchmark methods in experiments conducted using a behavior dataset
from the well-known e-commerce site Alibaba.
☆ Improving Line Search Methods for Large Scale Neural Network Training
In recent studies, line search methods have shown significant improvements in
the performance of traditional stochastic gradient descent techniques,
eliminating the need for a specific learning rate schedule. In this paper, we
identify existing issues in state-of-the-art line search methods, propose
enhancements, and rigorously evaluate their effectiveness. We test these
methods on larger datasets and more complex data domains than before.
Specifically, we improve the Armijo line search by integrating the momentum
term from ADAM in its search direction, enabling efficient large-scale
training, a task that was previously prone to failure using Armijo line search
methods. Our optimization approach outperforms both the previous Armijo
implementation and tuned learning rate schedules for Adam. Our evaluation
focuses on Transformers and CNNs in the domains of NLP and image data. Our work
is publicly available as a Python package, which provides a hyperparameter free
Pytorch optimizer.
☆ Faster Convergence for Transformer Fine-tuning with Line Search Methods
Recent works have shown that line search methods greatly increase performance
of traditional stochastic gradient descent methods on a variety of datasets and
architectures [1], [2]. In this work we succeed in extending line search
methods to the novel and highly popular Transformer architecture and dataset
domains in natural language processing. More specifically, we combine the
Armijo line search with the Adam optimizer and extend it by subdividing the
networks architecture into sensible units and perform the line search
separately on these local units. Our optimization method outperforms the
traditional Adam optimizer and achieves significant performance improvements
for small data sets or small training budgets, while performing equal or better
for other tested cases. Our work is publicly available as a python package,
which provides a hyperparameter-free pytorch optimizer that is compatible with
arbitrary network architectures.
☆ Impact of Employing Weather Forecast Data as Input to the Estimation of Evapotranspiration by Deep Neural Network Models SC
Reference Evapotranspiration (ET0) is a key parameter for designing smart
irrigation scheduling, since it is related by a coefficient to the water needs
of a crop. The United Nations Food and Agriculture Organization, proposed a
standard method for ET0 computation (FAO56PM), based on the parameterization of
the Penman-Monteith equation, that is widely adopted in the literature. To
compute ET0 using the FAO56-PM method, four main weather parameters are needed:
temperature, humidity, wind, and solar radiation (SR). One way to make daily
ET0 estimations for future days is to use freely available weather forecast
services (WFSs), where many meteorological parameters are estimated up to the
next 15 days. A problem with this method is that currently, SR is not provided
as a free forecast parameter on most of those online services or, normally,
such forecasts present a financial cost penalty. For this reason, several ET0
estimation models using machine and deep learning were developed and presented
in the literature, that use as input features a reduced set of carefully
selected weather parameters, that are compatible with common freely available
WFSs. However, most studies on this topic have only evaluated model performance
using data from weather stations (WSs), without considering the effect of using
weather forecast data. In this study, the performance of authors' previous
models is evaluated when using weather forecast data from two online WFSs, in
the following scenarios: (i) direct ET0 estimation by an ANN model, and (ii)
estimate SR by ANN model, and then use that estimation for ET0 computation,
using the FAO56-PM method. Employing data collected from two WFSs and a WS
located in Vale do Lobo, Portugal, the latter approach achieved the best
result, with a coefficient of determination (R2) ranging between 0.893 and
0.667, when considering forecasts up to 15 days.
comment: A partial version of the work submitted to ESRE/INTERNATIONAL
CONFERENCE ON ENVIRONMENTAL SCIENCES AND RENEWABLE ENERGY
☆ Synthesizing EEG Signals from Event-Related Potential Paradigms with Conditional Diffusion Models
Data scarcity in the brain-computer interface field can be alleviated through
the use of generative models, specifically diffusion models. While diffusion
models have previously been successfully applied to electroencephalogram (EEG)
data, existing models lack flexibility w.r.t.~sampling or require alternative
representations of the EEG data. To overcome these limitations, we introduce a
novel approach to conditional diffusion models that utilizes classifier-free
guidance to directly generate subject-, session-, and class-specific EEG data.
In addition to commonly used metrics, domain-specific metrics are employed to
evaluate the specificity of the generated samples. The results indicate that
the proposed model can generate EEG data that resembles real data for each
subject, session, and class.
comment: submitted to 9th Graz BCI conference, 6 pages, 3 figures, first
figure is split into two subfigures, 1 table
☆ Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds CVPR2024
3D synthetic-to-real unsupervised domain adaptive segmentation is crucial to
annotating new domains. Self-training is a competitive approach for this task,
but its performance is limited by different sensor sampling patterns (i.e.,
variations in point density) and incomplete training strategies. In this work,
we propose a density-guided translator (DGT), which translates point density
between domains, and integrates it into a two-stage self-training pipeline
named DGT-ST. First, in contrast to existing works that simultaneously conduct
data generation and feature/output alignment within unstable adversarial
training, we employ the non-learnable DGT to bridge the domain gap at the input
level. Second, to provide a well-initialized model for self-training, we
propose a category-level adversarial network in stage one that utilizes the
prototype to prevent negative transfer. Finally, by leveraging the designs
above, a domain-mixed self-training method with source-aware consistency loss
is proposed in stage two to narrow the domain gap further. Experiments on two
synthetic-to-real segmentation tasks (SynLiDAR $\rightarrow$ semanticKITTI and
SynLiDAR $\rightarrow$ semanticPOSS) demonstrate that DGT-ST outperforms
state-of-the-art methods, achieving 9.4$\%$ and 4.3$\%$ mIoU improvements,
respectively. Code is available at \url{https://github.com/yuan-zm/DGT-ST}.
comment: CVPR2024
☆ CoBOS: Constraint-Based Online Scheduler for Human-Robot Collaboration
Assembly processes involving humans and robots are challenging scenarios
because the individual activities and access to shared workspace have to be
coordinated. Fixed robot programs leave no room to diverge from a fixed
protocol. Working on such a process can be stressful for the user and lead to
ineffective behavior or failure. We propose a novel approach of online
constraint-based scheduling in a reactive execution control framework
facilitating behavior trees called CoBOS. This allows the robot to adapt to
uncertain events such as delayed activity completions and activity selection
(by the human). The user will experience less stress as the robotic coworkers
adapt their behavior to best complement the human-selected activities to
complete the common task. In addition to the improved working conditions, our
algorithm leads to increased efficiency, even in highly uncertain scenarios. We
evaluate our algorithm using a probabilistic simulation study with 56000
experiments. We outperform all baselines by a margin of 4-10%. Initial real
robot experiments using a Franka Emika Panda robot and human tracking based on
HTC Vive VR gloves look promising.
comment: 7 pages, 8 figures
☆ CoRAST: Towards Foundation Model-Powered Correlated Data Analysis in Resource-Constrained CPS and IoT
Foundation models (FMs) emerge as a promising solution to harness distributed
and diverse environmental data by leveraging prior knowledge to understand the
complicated temporal and spatial correlations within heterogeneous datasets.
Unlike distributed learning frameworks such as federated learning, which often
struggle with multimodal data, FMs can transform diverse inputs into
embeddings. This process facilitates the integration of information from
various modalities and the application of prior learning to new domains.
However, deploying FMs in resource-constrained edge systems poses significant
challenges. To this end, we introduce CoRAST, a novel learning framework that
utilizes FMs for enhanced analysis of distributed, correlated heterogeneous
data. Utilizing a server-based FM, CoRAST can exploit existing environment
information to extract temporal, spatial, and cross-modal correlations among
sensor data. This enables CoRAST to offer context-aware insights for localized
client tasks through FM-powered global representation learning. Our evaluation
on real-world weather dataset demonstrates CoRAST's ability to exploit
correlated heterogeneous data through environmental representation learning to
reduce the forecast errors by up to 50.3% compared to the baselines.
comment: accepted and to be published in 2024 IEEE International Workshop on
Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys)
☆ U-Sketch: An Efficient Approach for Sketch to Image Diffusion Models
Diffusion models have demonstrated remarkable performance in text-to-image
synthesis, producing realistic and high resolution images that faithfully
adhere to the corresponding text-prompts. Despite their great success, they
still fall behind in sketch-to-image synthesis tasks, where in addition to
text-prompts, the spatial layout of the generated images has to closely follow
the outlines of certain reference sketches. Employing an MLP latent edge
predictor to guide the spatial layout of the synthesized image by predicting
edge maps at each denoising step has been recently proposed. Despite yielding
promising results, the pixel-wise operation of the MLP does not take into
account the spatial layout as a whole, and demands numerous denoising
iterations to produce satisfactory images, leading to time inefficiency. To
this end, we introduce U-Sketch, a framework featuring a U-Net type latent edge
predictor, which is capable of efficiently capturing both local and global
features, as well as spatial correlations between pixels. Moreover, we propose
the addition of a sketch simplification network that offers the user the choice
of preprocessing and simplifying input sketches for enhanced outputs. The
experimental results, corroborated by user feedback, demonstrate that our
proposed U-Net latent edge predictor leads to more realistic results, that are
better aligned with the spatial outlines of the reference sketches, while
drastically reducing the number of required denoising steps and, consequently,
the overall execution time.
☆ BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text
Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, Christopher D. Manning
Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance
on a wide variety of biomedical NLP tasks. However, these models have hundreds
of billions of parameters, are computationally expensive to run, require users
to send their input data over the internet, and are trained on unknown data
sources. Can smaller, more targeted models compete? To address this question,
we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive
model trained exclusively on PubMed abstracts and full articles. When
fine-tuned, BioMedLM can produce strong multiple-choice biomedical
question-answering results competitive with much larger models, such as
achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical
Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to
patient questions on medical topics. This demonstrates that smaller models can
potentially serve as transparent, privacy-preserving, economical and
environmentally friendly foundations for particular NLP applications, such as
in biomedicine. The model is available on the Hugging Face Hub:
https://huggingface.co/stanford-crfm/BioMedLM.
comment: 23 pages
☆ A Channel-ensemble Approach: Unbiased and Low-variance Pseudo-labels is Critical for Semi-supervised Classification
Semi-supervised learning (SSL) is a practical challenge in computer vision.
Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of
The Art (SOTA) performances in SSL. These approaches employ a
threshold-to-pseudo-label (T2L) process to generate PLs by truncating the
confidence scores of unlabeled data predicted by the self-training method.
However, self-trained models typically yield biased and high-variance
predictions, especially in the scenarios when a little labeled data are
supplied. To address this issue, we propose a lightweight channel-based
ensemble method to effectively consolidate multiple inferior PLs into the
theoretically guaranteed unbiased and low-variance one. Importantly, our
approach can be readily extended to any SSL framework, such as FixMatch or
FreeMatch. Experimental results demonstrate that our method significantly
outperforms state-of-the-art techniques on CIFAR10/100 in terms of
effectiveness and efficiency.
☆ An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Stimulated by the sophisticated reasoning capabilities of recent Large
Language Models (LLMs), a variety of strategies for bridging video modality
have been devised. A prominent strategy involves Video Language Models
(VideoLMs), which train a learnable interface with video data to connect
advanced vision encoders with LLMs. Recently, an alternative strategy has
surfaced, employing readily available foundation models, such as VideoLMs and
LLMs, across multiple stages for modality bridging. In this study, we introduce
a simple yet novel strategy where only a single Vision Language Model (VLM) is
utilized. Our starting point is the plain insight that a video comprises a
series of images, or frames, interwoven with temporal information. The essence
of video comprehension lies in adeptly managing the temporal aspects along with
the spatial details of each frame. Initially, we transform a video into a
single composite image by arranging multiple frames in a grid layout. The
resulting single image is termed as an image grid. This format, while
maintaining the appearance of a solitary image, effectively retains temporal
information within the grid structure. Therefore, the image grid approach
enables direct application of a single high-performance VLM without
necessitating any video-data training. Our extensive experimental analysis
across ten zero-shot video question answering benchmarks, including five
open-ended and five multiple-choice benchmarks, reveals that the proposed Image
Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out
of ten benchmarks.
comment: Our code is available at https://github.com/imagegridworth/IG-VLM
☆ Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval
Collecting relevant judgments for legal case retrieval is a challenging and
time-consuming task. Accurately judging the relevance between two legal cases
requires a considerable effort to read the lengthy text and a high level of
domain expertise to extract Legal Facts and make juridical judgments. With the
advent of advanced large language models, some recent studies have suggested
that it is promising to use LLMs for relevance judgment. Nonetheless, the
method of employing a general large language model for reliable relevance
judgments in legal case retrieval is yet to be thoroughly explored. To fill
this research gap, we devise a novel few-shot workflow tailored to the relevant
judgment of legal cases. The proposed workflow breaks down the annotation
process into a series of stages, imitating the process employed by human
annotators and enabling a flexible integration of expert reasoning to enhance
the accuracy of relevance judgments. By comparing the relevance judgments of
LLMs and human experts, we empirically show that we can obtain reliable
relevance judgments with the proposed workflow. Furthermore, we demonstrate the
capacity to augment existing legal case retrieval models through the synthesis
of data generated by the large language model.
☆ Colour and Brush Stroke Pattern Recognition in Abstract Art using Modified Deep Convolutional Generative Adversarial Networks
Abstract Art is an immensely popular, discussed form of art that often has
the ability to depict the emotions of an artist. Many researchers have made
attempts to study abstract art in the form of edge detection, brush stroke and
emotion recognition algorithms using machine and deep learning. This papers
describes the study of a wide distribution of abstract paintings using
Generative Adversarial Neural Networks(GAN). GANs have the ability to learn and
reproduce a distribution enabling researchers and scientists to effectively
explore and study the generated image space. However, the challenge lies in
developing an efficient GAN architecture that overcomes common training
pitfalls. This paper addresses this challenge by introducing a modified-DCGAN
(mDCGAN) specifically designed for high-quality artwork generation. The
approach involves a thorough exploration of the modifications made, delving
into the intricate workings of DCGANs, optimisation techniques, and
regularisation methods aimed at improving stability and realism in art
generation enabling effective study of generated patterns. The proposed mDCGAN
incorporates meticulous adjustments in layer configurations and architectural
choices, offering tailored solutions to the unique demands of art generation
while effectively combating issues like mode collapse and gradient vanishing.
Further this paper explores the generated latent space by performing random
walks to understand vector relationships between brush strokes and colours in
the abstract art space and a statistical analysis of unstable outputs after a
certain period of GAN training and compare its significant difference. These
findings validate the effectiveness of the proposed approach, emphasising its
potential to revolutionise the field of digital art generation and digital art
ecosystem.
comment: 28 pages, 5 tables, 7 figures
☆ FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion
Spiking Neural Networks (SNNs) offer a promising avenue for energy-efficient
computing compared with Artificial Neural Networks (ANNs), closely mirroring
biological neural processes. However, this potential comes with inherent
challenges in directly training SNNs through spatio-temporal backpropagation --
stemming from the temporal dynamics of spiking neurons and their discrete
signal processing -- which necessitates alternative ways of training, most
notably through ANN-SNN conversion. In this work, we introduce a lightweight
Forward Temporal Bias Correction (FTBC) technique, aimed at enhancing
conversion accuracy without the computational overhead. We ground our method on
provided theoretical findings that through proper temporal bias calibration the
expected error of ANN-SNN conversion can be reduced to be zero after each time
step. We further propose a heuristic algorithm for finding the temporal bias
only in the forward pass, thus eliminating the computational burden of
backpropagation and we evaluate our method on CIFAR-10/100 and ImageNet
datasets, achieving a notable increase in accuracy on all datasets. Codes are
released at a GitHub repository.
☆ Generative Multi-modal Models are Good Class-Incremental Learners CVPR 2024
In class-incremental learning (CIL) scenarios, the phenomenon of catastrophic
forgetting caused by the classifier's bias towards the current task has long
posed a significant challenge. It is mainly caused by the characteristic of
discriminative models. With the growing popularity of the generative
multi-modal models, we would explore replacing discriminative models with
generative ones for CIL. However, transitioning from discriminative to
generative models requires addressing two key challenges. The primary challenge
lies in transferring the generated textual information into the classification
of distinct categories. Additionally, it requires formulating the task of CIL
within a generative framework. To this end, we propose a novel generative
multi-modal model (GMM) framework for class-incremental learning. Our approach
directly generates labels for images using an adapted generative model. After
obtaining the detailed text, we use a text encoder to extract text features and
employ feature matching to determine the most similar label as the
classification prediction. In the conventional CIL settings, we achieve
significantly better results in long-sequence task scenarios. Under the
Few-shot CIL setting, we have improved by at least 14\% accuracy over all the
current state-of-the-art methods with significantly less forgetting. Our code
is available at \url{https://github.com/DoubleClass/GMM}.
comment: Accepted at CVPR 2024
☆ Improving Attributed Text Generation of Large Language Models via Preference Learning
Large language models have been widely adopted in natural language
processing, yet they face the challenge of generating unreliable content.
Recent works aim to reduce misinformation and hallucinations by resorting to
attribution as a means to provide evidence (i.e., citations). However, current
attribution methods usually focus on the retrieval stage and automatic
evaluation that neglect mirroring the citation mechanisms in human scholarly
writing to bolster credibility. In this paper, we address these challenges by
modelling the attribution task as preference learning and introducing an
Automatic Preference Optimization (APO) framework. First, we create a curated
collection for post-training with 6,330 examples by collecting and filtering
from existing datasets. Second, considering the high cost of labelling
preference data, we further propose an automatic method to synthesize
attribution preference data resulting in 95,263 pairs. Moreover, inspired by
the human citation process, we further propose a progressive preference
optimization method by leveraging fine-grained information. Extensive
experiments on three datasets (i.e., ASQA, StrategyQA, and ELI5) demonstrate
that APO achieves state-of-the-art citation F1 with higher answer quality.
comment: 23 pages, 15 tables, 2 figures
☆ IIP-Mixer:Intra-Inter Patch Mixing Architecture for Battery Remaining Useful Life Prediction
Accurately estimating the Remaining Useful Life (RUL) of lithium-ion
batteries is crucial for maintaining the safe and stable operation of
rechargeable battery management systems. However, this task is often
challenging due to the complex temporal dynamics involved. Recently,
attention-based networks, such as Transformers and Informer, have been the
popular architecture in time series forecasting. Despite their effectiveness,
these models with abundant parameters necessitate substantial training time to
unravel temporal patterns. To tackle these challenges, we propose a simple
MLP-Mixer-based architecture named 'Intra-Inter Patch Mixer' (IIP-Mixer), which
is an architecture based exclusively on multi-layer perceptrons (MLPs),
extracting information by mixing operations along both intra-patch and
inter-patch dimensions for battery RUL prediction. The proposed IIP-Mixer
comprises parallel dual-head mixer layers: the intra-patch mixing MLP,
capturing local temporal patterns in the short-term period, and the inter-patch
mixing MLP, capturing global temporal patterns in the long-term period.
Notably, to address the varying importance of features in RUL prediction, we
introduce a weighted loss function in the MLP-Mixer-based architecture, marking
the first time such an approach has been employed. Our experiments demonstrate
that IIP-Mixer achieves competitive performance in battery RUL prediction,
outperforming other popular time-series frameworks
☆ Intent-Aware DRL-Based Uplink Dynamic Scheduler for 5G-NR
We investigate the problem of supporting Industrial Internet of Things user
equipment (IIoT UEs) with intent (i.e., requested quality of service (QoS)) and
random traffic arrival. A deep reinforcement learning (DRL) based centralized
dynamic scheduler for time-frequency resources is proposed to learn how to
schedule the available communication resources among the IIoT UEs. The proposed
scheduler leverages an RL framework to adapt to the dynamic changes in the
wireless communication system and traffic arrivals. Moreover, a graph-based
reduction scheme is proposed to reduce the state and action space of the RL
framework to allow fast convergence and a better learning strategy. Simulation
results demonstrate the effectiveness of the proposed intelligent scheduler in
guaranteeing the expressed intent of IIoT UEs compared to several traditional
scheduling schemes, such as round-robin, semi-static, and heuristic approaches.
The proposed scheduler also outperforms the contention-free and
contention-based schemes in maximizing the number of successfully computed
tasks.
☆ Generating Diverse Agricultural Data for Vision-Based Farming Applications
Mikolaj Cieslak, Umabharathi Govindarajan, Alejandro Garcia, Anuradha Chandrashekar, Torsten Hädrich, Aleksander Mendoza-Drosik, Dominik L. Michels, Sören Pirk, Chia-Chun Fu, Wojciech Pałubicki
We present a specialized procedural model for generating synthetic
agricultural scenes, focusing on soybean crops, along with various weeds. This
model is capable of simulating distinct growth stages of these plants, diverse
soil conditions, and randomized field arrangements under varying lighting
conditions. The integration of real-world textures and environmental factors
into the procedural generation process enhances the photorealism and
applicability of the synthetic data. Our dataset includes 12,000 images with
semantic labels, offering a comprehensive resource for computer vision tasks in
precision agriculture, such as semantic segmentation for autonomous weed
control. We validate our model's effectiveness by comparing the synthetic data
against real agricultural images, demonstrating its potential to significantly
augment training data for machine learning models in agriculture. This approach
not only provides a cost-effective solution for generating high-quality,
diverse data but also addresses specific needs in agricultural vision tasks
that are not fully covered by general-purpose models.
comment: 10 pages, 8 figures, 3 tables
☆ A Quantum Fuzzy-based Approach for Real-Time Detection of Solar Coronal Holes
The detection and analysis of the solar coronal holes (CHs) is an important
field of study in the domain of solar physics. Mainly, it is required for the
proper prediction of the geomagnetic storms which directly or indirectly affect
various space and ground-based systems. For the detection of CHs till date, the
solar scientist depends on manual hand-drawn approaches. However, with the
advancement of image processing technologies, some automated image segmentation
methods have been used for the detection of CHs. In-spite of this, fast and
accurate detection of CHs are till a major issues. Here in this work, a novel
quantum computing-based fast fuzzy c-mean technique has been developed for fast
detection of the CHs region. The task has been carried out in two stages, in
first stage the solar image has been segmented using a quantum computing based
fast fuzzy c-mean (QCFFCM) and in the later stage the CHs has been extracted
out from the segmented image based on image morphological operation. In the
work, quantum computing has been used to optimize the cost function of the fast
fuzzy c-mean (FFCM) algorithm, where quantum approximate optimization algorithm
(QAOA) has been used to optimize the quadratic part of the cost function. The
proposed method has been tested for 193 \AA{} SDO/AIA full-disk solar image
datasets and has been compared with the existing techniques. The outcome shows
the comparable performance of the proposed method with the existing one within
a very lesser time.
comment: 14 pages, 5 figures, 3 tables
☆ LC-LLM: Explainable Lane-Change Intention and Trajectory Predictions with Large Language Models
To ensure safe driving in dynamic environments, autonomous vehicles should
possess the capability to accurately predict the lane change intentions of
surrounding vehicles in advance and forecast their future trajectories.
Existing motion prediction approaches have ample room for improvement,
particularly in terms of long-term prediction accuracy and interpretability. In
this paper, we address these challenges by proposing LC-LLM, an explainable
lane change prediction model that leverages the strong reasoning capabilities
and self-explanation abilities of Large Language Models (LLMs). Essentially, we
reformulate the lane change prediction task as a language modeling problem,
processing heterogeneous driving scenario information in natural language as
prompts for input into the LLM and employing a supervised fine-tuning technique
to tailor the LLM specifically for our lane change prediction task. This allows
us to utilize the LLM's powerful common sense reasoning abilities to understand
complex interactive information, thereby improving the accuracy of long-term
predictions. Furthermore, we incorporate explanatory requirements into the
prompts in the inference stage. Therefore, our LC-LLM model not only can
predict lane change intentions and trajectories but also provides explanations
for its predictions, enhancing the interpretability. Extensive experiments on
the large-scale highD dataset demonstrate the superior performance and
interpretability of our LC-LLM in lane change prediction task. To the best of
our knowledge, this is the first attempt to utilize LLMs for predicting lane
change behavior. Our study shows that LLMs can encode comprehensive interaction
information for driving behavior understanding.
☆ mALBERT: Is a Compact Multilingual BERT Model Still Worth It?
Within the current trend of Pretained Language Models (PLM), emerge more and
more criticisms about the ethical andecological impact of such models. In this
article, considering these critical remarks, we propose to focus on
smallermodels, such as compact models like ALBERT, which are more ecologically
virtuous than these PLM. However,PLMs enable huge breakthroughs in Natural
Language Processing tasks, such as Spoken and Natural LanguageUnderstanding,
classification, Question--Answering tasks. PLMs also have the advantage of
being multilingual, and,as far as we know, a multilingual version of compact
ALBERT models does not exist. Considering these facts, wepropose the free
release of the first version of a multilingual compact ALBERT model,
pre-trained using Wikipediadata, which complies with the ethical aspect of such
a language model. We also evaluate the model against classicalmultilingual PLMs
in classical NLP tasks. Finally, this paper proposes a rare study on the
subword tokenizationimpact on language performances.
comment: The 2024 Joint International Conference on Computational Linguistics,
Language Resources and Evaluation, May 2024, Torino, Italy
☆ Can LLMs Converse Formally? Automatically Assessing LLMs in Translating and Interpreting Formal Specifications
Stakeholders often describe system requirements using natural language which
are then converted to formal syntax by a domain-expert leading to increased
design costs. This paper assesses the capabilities of Large Language Models
(LLMs) in converting between natural language descriptions and formal
specifications. Existing work has evaluated the capabilities of LLMs in
generating formal syntax such as source code but such experiments are typically
hand-crafted and use problems that are likely to be in the training set of
LLMs, and often require human-annotated datasets. We propose an approach that
can use two copies of an LLM in conjunction with an off-the-shelf verifier to
automatically evaluate its translation abilities without any additional human
input. Our approach generates formal syntax using language grammars to
automatically generate a dataset. We conduct an empirical evaluation to measure
the accuracy of this translation task and show that SOTA LLMs cannot adequately
solve this task, limiting their current utility in the design of complex
systems.
☆ Chinese Offensive Language Detection:Current Status and Future Directions
Despite the considerable efforts being made to monitor and regulate
user-generated content on social media platforms, the pervasiveness of
offensive language, such as hate speech or cyberbullying, in the digital space
remains a significant challenge. Given the importance of maintaining a
civilized and respectful online environment, there is an urgent and growing
need for automatic systems capable of detecting offensive speech in real time.
However, developing effective systems for processing languages such as Chinese
presents a significant challenge, owing to the language's complex and nuanced
nature, which makes it difficult to process automatically. This paper provides
a comprehensive overview of offensive language detection in Chinese, examining
current benchmarks and approaches and highlighting specific models and tools
for addressing the unique challenges of detecting offensive language in this
complex language. The primary objective of this survey is to explore the
existing techniques and identify potential avenues for further research that
can address the cultural and linguistic complexities of Chinese.
☆ A thermodynamically consistent physics-informed deep learning material model for short fiber/polymer nanocomposites
This work proposes a physics-informed deep learning (PIDL)-based constitutive
model for investigating the viscoelastic-viscoplastic behavior of short
fiber-reinforced nanoparticle-filled epoxies under various ambient conditions.
The deep-learning model is trained to enforce thermodynamic principles, leading
to a thermodynamically consistent constitutive model. To accomplish this, a
long short-term memory network is combined with a feed-forward neural network
to predict internal variables required for characterizing the internal
dissipation of the nanocomposite materials. In addition, another feed-forward
neural network is used to indicate the free-energy function, which enables
defining the thermodynamic state of the entire system. The PIDL model is
initially developed for the three-dimensional case by generating synthetic data
from a classical constitutive model. The model is then trained by extracting
the data directly from cyclic loading-unloading experimental tests. Numerical
examples show that the PIDL model can accurately predict the mechanical
behavior of epoxy-based nanocomposites for different volume fractions of fibers
and nanoparticles under various hygrothermal conditions.
comment: arXiv admin note: text overlap with arXiv:2305.08102
☆ A Recommender System for NFT Collectibles with Item Feature AAAI 2023
Recommender systems have been actively studied and applied in various domains
to deal with information overload. Although there are numerous studies on
recommender systems for movies, music, and e-commerce, comparatively less
attention has been paid to the recommender system for NFTs despite the
continuous growth of the NFT market. This paper presents a recommender system
for NFTs that utilizes a variety of data sources, from NFT transaction records
to external item features, to generate precise recommendations that cater to
individual preferences. We develop a data-efficient graph-based recommender
system to efficiently capture the complex relationship between each item and
users and generate node(item) embeddings which incorporate both node feature
information and graph structure. Furthermore, we exploit inputs beyond
user-item interactions, such as image feature, text feature, and price feature.
Numerical experiments verify the performance of the graph-based recommender
system improves significantly after utilizing all types of item features as
side information, thereby outperforming all other baselines.
comment: Presented at the AAAI 2023 Bridge on AI for Financial Services
(https://sites.google.com/view/aaai-ai-fin/home)
☆ Selective Mixup Fine-Tuning for Optimizing Non-Decomposable Objectives ICLR 2024
Shrinivas Ramasubramanian, Harsh Rangwani, Sho Takemori, Kunal Samanta, Yuhei Umeda, Venkatesh Babu Radhakrishnan
The rise in internet usage has led to the generation of massive amounts of
data, resulting in the adoption of various supervised and semi-supervised
machine learning algorithms, which can effectively utilize the colossal amount
of data to train models. However, before deploying these models in the real
world, these must be strictly evaluated on performance measures like worst-case
recall and satisfy constraints such as fairness. We find that current
state-of-the-art empirical techniques offer sub-optimal performance on these
practical, non-decomposable performance objectives. On the other hand, the
theoretical techniques necessitate training a new model from scratch for each
performance objective. To bridge the gap, we propose SelMix, a selective
mixup-based inexpensive fine-tuning technique for pre-trained models, to
optimize for the desired objective. The core idea of our framework is to
determine a sampling distribution to perform a mixup of features between
samples from particular classes such that it optimizes the given objective. We
comprehensively evaluate our technique against the existing empirical and
theoretically principled methods on standard benchmark datasets for imbalanced
classification. We find that proposed SelMix fine-tuning significantly improves
the performance for various practical non-decomposable objectives across
benchmarks.
comment: ICLR 2024 SpotLight
☆ GeNet: A Graph Neural Network-based Anti-noise Task-Oriented Semantic Communication Paradigm
Traditional approaches to semantic communication tasks rely on the knowledge
of the signal-to-noise ratio (SNR) to mitigate channel noise. However, these
methods necessitate training under specific SNR conditions, entailing
considerable time and computational resources. In this paper, we propose GeNet,
a Graph Neural Network (GNN)-based paradigm for semantic communication aimed at
combating noise, thereby facilitating Task-Oriented Communication (TOC). We
propose a novel approach where we first transform the input data image into
graph structures. Then we leverage a GNN-based encoder to extract semantic
information from the source data. This extracted semantic information is then
transmitted through the channel. At the receiver's end, a GNN-based decoder is
utilized to reconstruct the relevant semantic information from the source data
for TOC. Through experimental evaluation, we show GeNet's effectiveness in
anti-noise TOC while decoupling the SNR dependency. We further evaluate GeNet's
performance by varying the number of nodes, revealing its versatility as a new
paradigm for semantic communication. Additionally, we show GeNet's robustness
to geometric transformations by testing it with different rotation angles,
without resorting to data augmentation.
☆ Few-Shot Recalibration of Language Models
Recent work has uncovered promising ways to extract well-calibrated
confidence estimates from language models (LMs), where the model's confidence
score reflects how likely it is to be correct. However, while LMs may appear
well-calibrated over broad distributions, this often hides significant
miscalibration within narrower slices (e.g., systemic over-confidence in math
can balance out systemic under-confidence in history, yielding perfect
calibration in aggregate). To attain well-calibrated confidence estimates for
any slice of a distribution, we propose a new framework for few-shot
slice-specific recalibration. Specifically, we train a recalibration model that
takes in a few unlabeled examples from any given slice and predicts a curve
that remaps confidence scores to be more accurate for that slice. Our trained
model can recalibrate for arbitrary new slices, without using any labeled data
from that slice. This enables us to identify domain-specific confidence
thresholds above which the LM's predictions can be trusted, and below which it
should abstain. Experiments show that our few-shot recalibrator consistently
outperforms existing calibration methods, for instance improving calibration
error for PaLM2-Large on MMLU by 16%, as compared to temperature scaling.
comment: preprint
☆ Identification and Uses of Deep Learning Backbones via Pattern Mining SDM24
Deep learning is extensively used in many areas of data mining as a black-box
method with impressive results. However, understanding the core mechanism of
how deep learning makes predictions is a relatively understudied problem. Here
we explore the notion of identifying a backbone of deep learning for a given
group of instances. A group here can be instances of the same class or even
misclassified instances of the same class. We view each instance for a given
group as activating a subset of neurons and attempt to find a subgraph of
neurons associated with a given concept/group. We formulate this problem as a
set cover style problem and show it is intractable and presents a highly
constrained integer linear programming (ILP) formulation. As an alternative, we
explore a coverage-based heuristic approach related to pattern mining, and show
it converges to a Pareto equilibrium point of the ILP formulation.
Experimentally we explore these backbones to identify mistakes and improve
performance, explanation, and visualization. We demonstrate application-based
results using several challenging data sets, including Bird Audio Detection
(BAD) Challenge and Labeled Faces in the Wild (LFW), as well as the classic
MNIST data.
comment: 9 pages, 6 figures, published SIAM SDM24
☆ DSF-GAN: DownStream Feedback Generative Adversarial Network
Utility and privacy are two crucial measurements of the quality of synthetic
tabular data. While significant advancements have been made in privacy
measures, generating synthetic samples with high utility remains challenging.
To enhance the utility of synthetic samples, we propose a novel architecture
called the DownStream Feedback Generative Adversarial Network (DSF-GAN). This
approach incorporates feedback from a downstream prediction model during
training to augment the generator's loss function with valuable information.
Thus, DSF-GAN utilizes a downstream prediction task to enhance the utility of
synthetic samples. To evaluate our method, we tested it using two popular
datasets. Our experiments demonstrate improved model performance when training
on synthetic samples generated by DSF-GAN, compared to those generated by the
same GAN architecture without feedback. The evaluation was conducted on the
same validation set comprising real samples. All code and datasets used in this
research will be made openly available for ease of reproduction.
☆ Enhancing Generative Class Incremental Learning Performance with Model Forgetting Approach
This study presents a novel approach to Generative Class Incremental Learning
(GCIL) by introducing the forgetting mechanism, aimed at dynamically managing
class information for better adaptation to streaming data. GCIL is one of the
hot topics in the field of computer vision, and this is considered one of the
crucial tasks in society, specifically the continual learning of generative
models. The ability to forget is a crucial brain function that facilitates
continual learning by selectively discarding less relevant information for
humans. However, in the field of machine learning models, the concept of
intentionally forgetting has not been extensively investigated. In this study
we aim to bridge this gap by incorporating the forgetting mechanisms into GCIL,
thereby examining their impact on the models' ability to learn in continual
learning. Through our experiments, we have found that integrating the
forgetting mechanisms significantly enhances the models' performance in
acquiring new knowledge, underscoring the positive role that strategic
forgetting plays in the process of continual learning.
☆ Manipulating Neural Path Planners via Slight Perturbations
Data-driven neural path planners are attracting increasing interest in the
robotics community. However, their neural network components typically come as
black boxes, obscuring their underlying decision-making processes. Their
black-box nature exposes them to the risk of being compromised via the
insertion of hidden malicious behaviors. For example, an attacker may hide
behaviors that, when triggered, hijack a delivery robot by guiding it to a
specific (albeit wrong) destination, trapping it in a predefined region, or
inducing unnecessary energy expenditure by causing the robot to repeatedly
circle a region. In this paper, we propose a novel approach to specify and
inject a range of hidden malicious behaviors, known as backdoors, into neural
path planners. Our approach provides a concise but flexible way to define these
behaviors, and we show that hidden behaviors can be triggered by slight
perturbations (e.g., inserting a tiny unnoticeable object), that can
nonetheless significantly compromise their integrity. We also discuss potential
techniques to identify these backdoors aimed at alleviating such risks. We
demonstrate our approach on both sampling-based and search-based neural path
planners.
☆ Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
Visual representation learning has been a cornerstone in computer vision,
evolving from supervised learning with human-annotated labels to aligning
image-text pairs from the Internet. Despite recent advancements in multi-modal
large language models (MLLMs), the visual representations they rely on, such as
CLIP embeddings, often lack access to external world knowledge critical for
real-world visual reasoning. In this work, we propose Visual Table, a novel
visual representation tailored for MLLMs. It provides hierarchical text
descriptions of holistic visual scenes, consisting of a scene description and
multiple object-centric descriptions that encompass categories, attributes, and
knowledge at instance level. We further develop a scalable generator for visual
table generation and train it on small-scale annotations from GPT4V. Extensive
evaluations demonstrate that, with generated visual tables as additional visual
representations, our model can consistently outperform the state-of-the-art
(SOTA) MLLMs across diverse benchmarks. When visual tables serve as standalone
visual representations, our model can closely match or even beat the SOTA MLLMs
that are built on CLIP visual embeddings. Our code is available at
https://github.com/LaVi-Lab/Visual-Table.
comment: Project page: https://github.com/LaVi-Lab/Visual-Table
☆ Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check
Retrieval-Augmented Generation (RAG) aims to generate more reliable and
accurate responses, by augmenting large language models (LLMs) with the
external vast and dynamic knowledge. Most previous work focuses on using RAG
for single-round question answering, while how to adapt RAG to the complex
conversational setting wherein the question is interdependent on the preceding
context is not well studied. In this paper, we propose a conversation-level RAG
approach, which incorporates fine-grained retrieval augmentation and self-check
for conversational question answering (CQA). In particular, our approach
consists of three components, namely conversational question refiner,
fine-grained retriever and self-check based response generator, which work
collaboratively for question understanding and relevant information acquisition
in conversational settings. Extensive experiments demonstrate the great
advantages of our approach over the state-of-the-art baselines. Moreover, we
also release a Chinese CQA dataset with new features including reformulated
question, extracted keyword, retrieved paragraphs and their helpfulness, which
facilitates further researches in RAG enhanced CQA.
☆ NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation
Ruikai Cui, Weizhe Liu, Weixuan Sun, Senbo Wang, Taizhang Shang, Yang Li, Xibin Song, Han Yan, Zhennan Wu, Shenzhou Chen, Hongdong Li, Pan Ji
3D shape generation aims to produce innovative 3D content adhering to
specific conditions and constraints. Existing methods often decompose 3D shapes
into a sequence of localized components, treating each element in isolation
without considering spatial consistency. As a result, these approaches exhibit
limited versatility in 3D data representation and shape generation, hindering
their ability to generate highly diverse 3D shapes that comply with the
specified constraints. In this paper, we introduce a novel spatial-aware 3D
shape generation framework that leverages 2D plane representations for enhanced
3D shape modeling. To ensure spatial coherence and reduce memory usage, we
incorporate a hybrid shape representation technique that directly learns a
continuous signed distance field representation of the 3D shape using
orthogonal 2D planes. Additionally, we meticulously enforce spatial
correspondences across distinct planes using a transformer-based autoencoder
structure, promoting the preservation of spatial relationships in the generated
3D shapes. This yields an algorithm that consistently outperforms
state-of-the-art 3D shape generation methods on various tasks, including
unconditional shape generation, multi-modal shape completion, single-view
reconstruction, and text-to-shape synthesis.
☆ Large Language Models Need Consultants for Reasoning: Becoming an Expert in a Complex Human System Through Behavior Simulation
Large language models (LLMs), in conjunction with various reasoning
reinforcement methodologies, have demonstrated remarkable capabilities
comparable to humans in fields such as mathematics, law, coding, common sense,
and world knowledge. In this paper, we delve into the reasoning abilities of
LLMs within complex human systems. We propose a novel reasoning framework,
termed ``Mosaic Expert Observation Wall'' (MEOW) exploiting
generative-agents-based simulation technique. In the MEOW framework, simulated
data are utilized to train an expert model concentrating ``experience'' about a
specific task in each independent time of simulation. It is the accumulated
``experience'' through the simulation that makes for an expert on a task in a
complex human system. We conduct the experiments within a communication game
that mirrors real-world security scenarios. The results indicate that our
proposed methodology can cooperate with existing methodologies to enhance the
reasoning abilities of LLMs in complex human systems.
☆ A Transformer-Based Framework for Payload Malware Detection and Classification
As malicious cyber threats become more sophisticated in breaching computer
networks, the need for effective intrusion detection systems (IDSs) becomes
crucial. Techniques such as Deep Packet Inspection (DPI) have been introduced
to allow IDSs analyze the content of network packets, providing more context
for identifying potential threats. IDSs traditionally rely on using
anomaly-based and signature-based detection techniques to detect unrecognized
and suspicious activity. Deep learning techniques have shown great potential in
DPI for IDSs due to their efficiency in learning intricate patterns from the
packet content being transmitted through the network. In this paper, we propose
a revolutionary DPI algorithm based on transformers adapted for the purpose of
detecting malicious traffic with a classifier head. Transformers learn the
complex content of sequence data and generalize them well to similar scenarios
thanks to their self-attention mechanism. Our proposed method uses the raw
payload bytes that represent the packet contents and is deployed as
man-in-the-middle. The payload bytes are used to detect malicious packets and
classify their types. Experimental results on the UNSW-NB15 and CIC-IOT23
datasets demonstrate that our transformer-based model is effective in
distinguishing malicious from benign traffic in the test dataset, attaining an
average accuracy of 79\% using binary classification and 72\% on the
multi-classification experiment, both using solely payload bytes.
☆ From Two-Dimensional to Three-Dimensional Environment with Q-Learning: Modeling Autonomous Navigation with Reinforcement Learning and no Libraries
Reinforcement learning (RL) algorithms have become indispensable tools in
artificial intelligence, empowering agents to acquire optimal decision-making
policies through interactions with their environment and feedback mechanisms.
This study explores the performance of RL agents in both two-dimensional (2D)
and three-dimensional (3D) environments, aiming to research the dynamics of
learning across different spatial dimensions. A key aspect of this
investigation is the absence of pre-made libraries for learning, with the
algorithm developed exclusively through computational mathematics. The
methodological framework centers on RL principles, employing a Q-learning agent
class and distinct environment classes tailored to each spatial dimension. The
research aims to address the question: How do reinforcement learning agents
adapt and perform in environments of varying spatial dimensions, particularly
in 2D and 3D settings? Through empirical analysis, the study evaluates agents'
learning trajectories and adaptation processes, revealing insights into the
efficacy of RL algorithms in navigating complex, multi-dimensional spaces.
Reflections on the findings prompt considerations for future research,
particularly in understanding the dynamics of learning in higher-dimensional
environments.
☆ Leveraging Large Language Models for Fuzzy String Matching in Political Science
Fuzzy string matching remains a key issue when political scientists combine
data from different sources. Existing matching methods invariably rely on
string distances, such as Levenshtein distance and cosine similarity. As such,
they are inherently incapable of matching strings that refer to the same entity
with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and
''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In
this letter, we propose to use large language models to entirely sidestep this
problem in an easy and intuitive manner. Extensive experiments show that our
proposed methods can improve the state of the art by as much as 39% in terms of
average precision while being substantially easier and more intuitive to use by
political scientists. Moreover, our results are robust against various
temperatures. We further note that enhanced prompting can lead to additional
performance improvements.
comment: 7 pages, 2 figures, 1 table;
☆ Preference-Based Planning in Stochastic Environments: From Partially-Ordered Temporal Goals to Most Preferred Policies
Human preferences are not always represented via complete linear orders: It
is natural to employ partially-ordered preferences for expressing incomparable
outcomes. In this work, we consider decision-making and probabilistic planning
in stochastic systems modeled as Markov decision processes (MDPs), given a
partially ordered preference over a set of temporally extended goals.
Specifically, each temporally extended goal is expressed using a formula in
Linear Temporal Logic on Finite Traces (LTL$_f$). To plan with the partially
ordered preference, we introduce order theory to map a preference over temporal
goals to a preference over policies for the MDP. Accordingly, a most preferred
policy under a stochastic ordering induces a stochastic nondominated
probability distribution over the finite paths in the MDP. To synthesize a most
preferred policy, our technical approach includes two key steps. In the first
step, we develop a procedure to transform a partially ordered preference over
temporal goals into a computational model, called preference automaton, which
is a semi-automaton with a partial order over acceptance conditions. In the
second step, we prove that finding a most preferred policy is equivalent to
computing a Pareto-optimal policy in a multi-objective MDP that is constructed
from the original MDP, the preference automaton, and the chosen stochastic
ordering relation. Throughout the paper, we employ running examples to
illustrate the proposed preference specification and solution approaches. We
demonstrate the efficacy of our algorithm using these examples, providing
detailed analysis, and then discuss several potential future directions.
comment: arXiv admin note: substantial text overlap with arXiv:2209.12267
☆ Long and Short-Term Constraints Driven Safe Reinforcement Learning for Autonomous Driving
Reinforcement learning (RL) has been widely used in decision-making tasks,
but it cannot guarantee the agent's safety in the training process due to the
requirements of interaction with the environment, which seriously limits its
industrial applications such as autonomous driving. Safe RL methods are
developed to handle this issue by constraining the expected safety violation
costs as a training objective, but they still permit unsafe state occurrence,
which is unacceptable in autonomous driving tasks. Moreover, these methods are
difficult to achieve a balance between the cost and return expectations, which
leads to learning performance degradation for the algorithms. In this paper, we
propose a novel algorithm based on the long and short-term constraints (LSTC)
for safe RL. The short-term constraint aims to guarantee the short-term state
safety that the vehicle explores, while the long-term constraint ensures the
overall safety of the vehicle throughout the decision-making process. In
addition, we develop a safe RL method with dual-constraint optimization based
on the Lagrange multiplier to optimize the training process for end-to-end
autonomous driving. Comprehensive experiments were conducted on the MetaDrive
simulator. Experimental results demonstrate that the proposed method achieves
higher safety in continuous state and action tasks, and exhibits higher
exploration performance in long-distance decision-making tasks compared with
state-of-the-art methods.
☆ An Evolutionary Network Architecture Search Framework with Adaptive Multimodal Fusion for Hand Gesture Recognition
Hand gesture recognition (HGR) based on multimodal data has attracted
considerable attention owing to its great potential in applications. Various
manually designed multimodal deep networks have performed well in multimodal
HGR (MHGR), but most of existing algorithms require a lot of expert experience
and time-consuming manual trials. To address these issues, we propose an
evolutionary network architecture search framework with the adaptive multimodel
fusion (AMF-ENAS). Specifically, we design an encoding space that
simultaneously considers fusion positions and ratios of the multimodal data,
allowing for the automatic construction of multimodal networks with different
architectures through decoding. Additionally, we consider three input streams
corresponding to intra-modal surface electromyography (sEMG), intra-modal
accelerometer (ACC), and inter-modal sEMG-ACC. To automatically adapt to
various datasets, the ENAS framework is designed to automatically search a MHGR
network with appropriate fusion positions and ratios. To the best of our
knowledge, this is the first time that ENAS has been utilized in MHGR to tackle
issues related to the fusion position and ratio of multimodal data.
Experimental results demonstrate that AMF-ENAS achieves state-of-the-art
performance on the Ninapro DB2, DB3, and DB7 datasets.
☆ Exploring the Privacy Protection Capabilities of Chinese Large Language Models
Large language models (LLMs), renowned for their impressive capabilities in
various tasks, have significantly advanced artificial intelligence. Yet, these
advancements have raised growing concerns about privacy and security
implications. To address these issues and explain the risks inherent in these
models, we have devised a three-tiered progressive framework tailored for
evaluating privacy in language systems. This framework consists of
progressively complex and in-depth privacy test tasks at each tier. Our primary
objective is to comprehensively evaluate the sensitivity of large language
models to private information, examining how effectively they discern, manage,
and safeguard sensitive data in diverse scenarios. This systematic evaluation
helps us understand the degree to which these models comply with privacy
protection guidelines and the effectiveness of their inherent safeguards
against privacy breaches. Our observations indicate that existing Chinese large
language models universally show privacy protection shortcomings. It seems that
at the moment this widespread issue is unavoidable and may pose corresponding
privacy risks in applications based on these models.
comment: 11 pages
☆ EndToEndML: An Open-Source End-to-End Pipeline for Machine Learning Applications
Artificial intelligence (AI) techniques are widely applied in the life
sciences. However, applying innovative AI techniques to understand and
deconvolute biological complexity is hindered by the learning curve for life
science scientists to understand and use computing languages. An open-source,
user-friendly interface for AI models, that does not require programming skills
to analyze complex biological data will be extremely valuable to the
bioinformatics community. With easy access to different sequencing technologies
and increased interest in different 'omics' studies, the number of biological
datasets being generated has increased and analyzing these high-throughput
datasets is computationally demanding. The majority of AI libraries today
require advanced programming skills as well as machine learning, data
preprocessing, and visualization skills. In this research, we propose a
web-based end-to-end pipeline that is capable of preprocessing, training,
evaluating, and visualizing machine learning (ML) models without manual
intervention or coding expertise. By integrating traditional machine learning
and deep neural network models with visualizations, our library assists in
recognizing, classifying, clustering, and predicting a wide range of
multi-modal, multi-sensor datasets, including images, languages, and
one-dimensional numerical data, for drug discovery, pathogen classification,
and medical diagnostics.
comment: 2024 7th International Conference on Information and Computer
Technologies (ICICT)
☆ Looking Beyond What You See: An Empirical Analysis on Subgroup Intersectional Fairness for Multi-label Chest X-ray Classification Using Social Determinants of Racial Health Inequities ICCV
There has been significant progress in implementing deep learning models in
disease diagnosis using chest X- rays. Despite these advancements, inherent
biases in these models can lead to disparities in prediction accuracy across
protected groups. In this study, we propose a framework to achieve accurate
diagnostic outcomes and ensure fairness across intersectional groups in
high-dimensional chest X- ray multi-label classification. Transcending
traditional protected attributes, we consider complex interactions within
social determinants, enabling a more granular benchmark and evaluation of
fairness. We present a simple and robust method that involves retraining the
last classification layer of pre-trained models using a balanced dataset across
groups. Additionally, we account for fairness constraints and integrate
class-balanced fine-tuning for multi-label settings. The evaluation of our
method on the MIMIC-CXR dataset demonstrates that our framework achieves an
optimal tradeoff between accuracy and fairness compared to baseline methods.
comment: ICCV CVAMD 2023
☆ SCANet: Correcting LEGO Assembly Errors with Self-Correct Assembly Network
Autonomous assembly in robotics and 3D vision presents significant
challenges, particularly in ensuring assembly correctness. Presently,
predominant methods such as MEPNet focus on assembling components based on
manually provided images. However, these approaches often fall short in
achieving satisfactory results for tasks requiring long-term planning.
Concurrently, we observe that integrating a self-correction module can
partially alleviate such issues. Motivated by this concern, we introduce the
single-step assembly error correction task, which involves identifying and
rectifying misassembled components. To support research in this area, we
present the LEGO Error Correction Assembly Dataset (LEGO-ECA), comprising
manual images for assembly steps and instances of assembly failures.
Additionally, we propose the Self-Correct Assembly Network (SCANet), a novel
method to address this task. SCANet treats assembled components as queries,
determining their correctness in manual images and providing corrections when
necessary. Finally, we utilize SCANet to correct the assembly results of
MEPNet. Experimental results demonstrate that SCANet can identify and correct
MEPNet's misassembled results, significantly improving the correctness of
assembly. Our code and dataset are available at
https://github.com/Yaser-wyx/SCANet.
☆ Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence
A well-designed document communicates not only through its words but also
through its visual eloquence. Authors utilize aesthetic elements such as
colors, fonts, graphics, and layouts to shape the perception of information.
Thoughtful document design, informed by psychological insights, enhances both
the visual appeal and the comprehension of the content. While state-of-the-art
document AI models demonstrate the benefits of incorporating layout and image
data, it remains unclear whether the nuances of document aesthetics are
effectively captured. To bridge the gap between human cognition and AI
interpretation of aesthetic elements, we formulated hypotheses concerning AI
behavior in document understanding tasks, specifically anchored in document
design principles. With a focus on legibility and layout quality, we tested
four aspects of aesthetic effects: noise, font-size contrast, alignment, and
complexity, on model confidence using correlational analysis. The results and
observations highlight the value of model analysis rooted in document design
theories. Our work serves as a trailhead for further studies and we advocate
for continued research in this topic to deepen our understanding of how AI
interprets document aesthetics.
☆ Mechanisms of non-factual hallucinations in language models
State-of-the-art language models (LMs) sometimes generate non-factual
hallucinations that misalign with world knowledge. Despite extensive efforts to
detect and mitigate hallucinations, understanding their internal mechanisms
remains elusive. Our study investigates the mechanistic causes of
hallucination, specifically non-factual ones where the LM incorrectly predicts
object attributes in response to subject-relation queries. With causal
mediation analysis and embedding space projection, we identify two general
mechanistic causes of hallucinations shared across LMs of various scales and
designs: 1) insufficient subject attribute knowledge in lower layer MLPs, and
2) failing to select the correct object attribute in upper layer attention
heads and MLPs. These two mechanisms exhibit varying degrees of subject-object
association, predictive uncertainty and perturbation robustness. Additionally,
we scrutinize LM pre-training checkpoints, revealing distinct learning dynamics
for the two mechanistic causes of hallucinations. We also highlight how
attribution features from our causal analysis can effectively construct
hallucination detectors. Our work proposes a mechanistic understanding of LM
factual errors.
♻ ☆ A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs
Large communication costs are a critical bottleneck in training
state-of-the-art neural networks on distributed systems. This paper introduces
AxoNN, a novel four-dimensional (4D) parallelization approach, inspired by
Agarwal's algorithm for matrix multiplication, for parallelizing tensor
computations in deep learning, AxoNN employs two key strategies to minimize
communication overhead. First, we optimize communication by overlapping
expensive collective operations (reduce-scatter, all-gather, all-reduce) with
computations. Our experiments with a 20-billion parameter transformer model
demonstrate that these optimizations deliver nearly 53\% improvement. Second,
we present an analytical model to assist users in identifying
communication-minimizing configurations within the vast search space defined by
our 4D algorithm. This model empowers practitioners by simplifying the tuning
process for their specific training workloads. When training an 80-billion
parameter model on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a
state-of-the-art framework, by a significant 26%. Additionally, it achieves 57%
of the theoretical peak FLOP/s.
♻ ☆ Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification
Pranav Singh, Raviteja Chukkapalli, Shravan Chaudhari, Luoyao Chen, Mei Chen, Jinqian Pan, Craig Smuda, Jacopo Cirrone
Advancements in clinical treatment are increasingly constrained by the
limitations of supervised learning techniques, which depend heavily on large
volumes of annotated data. The annotation process is not only costly but also
demands substantial time from clinical specialists. Addressing this issue, we
introduce the S4MI (Self-Supervision and Semi-Supervision for Medical Imaging)
pipeline, a novel approach that leverages advancements in self-supervised and
semi-supervised learning. These techniques engage in auxiliary tasks that do
not require labeling, thus simplifying the scaling of machine supervision
compared to fully-supervised methods. Our study benchmarks these techniques on
three distinct medical imaging datasets to evaluate their effectiveness in
classification and segmentation tasks. Notably, we observed that self
supervised learning significantly surpassed the performance of supervised
methods in the classification of all evaluated datasets. Remarkably, the
semi-supervised approach demonstrated superior outcomes in segmentation,
outperforming fully-supervised methods while using 50% fewer labels across all
datasets. In line with our commitment to contributing to the scientific
community, we have made the S4MI code openly accessible, allowing for broader
application and further development of these methods.
comment: Seventeen pages (incl. references), five figures, and one table.
(Under Review)
♻ ☆ Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization
Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, Weiming Lu
Large Language Models exhibit robust problem-solving capabilities for diverse
tasks. However, most LLM-based agents are designed as specific task solvers
with sophisticated prompt engineering, rather than agents capable of learning
and evolving through interactions. These task solvers necessitate manually
crafted prompts to inform task rules and regulate LLM behaviors, inherently
incapacitating to address complex dynamic scenarios e.g., large interactive
games. In light of this, we propose Agent-Pro: an LLM-based Agent with
Policy-level Reflection and Optimization that can learn a wealth of expertise
from interactive experiences and progressively elevate its behavioral policy.
Specifically, it involves a dynamic belief generation and reflection process
for policy evolution. Rather than action-level reflection, Agent-Pro
iteratively reflects on past trajectories and beliefs, fine-tuning its
irrational beliefs for a better policy. Moreover, a depth-first search is
employed for policy optimization, ensuring continual enhancement in policy
payoffs. Agent-Pro is evaluated across two games: Blackjack and Texas Hold'em,
outperforming vanilla LLM and specialized models. Our results show Agent-Pro
can learn and evolve in complex and dynamic scenes, which also benefits
numerous LLM-based applications.
comment: LLM-based Agent
♻ ☆ Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives
The reflection capacity of Large Language Model (LLM) has garnered extensive
attention. A post-hoc prompting strategy, e.g., reflexion and self-refine,
refines LLM's response based on self-evaluated or external feedback. However,
recent research indicates without external feedback, LLM's intrinsic reflection
is unstable. Our investigation unveils that the key bottleneck is the quality
of the self-evaluated feedback. We find LLMs often exhibit overconfidence or
high randomness when self-evaluate, offering stubborn or inconsistent feedback,
which causes poor reflection. To remedy this, we advocate Self-Contrast: It
adaptively explores diverse solving perspectives tailored to the request,
contrasts the differences, and summarizes these discrepancies into a checklist
which could be used to re-examine and eliminate discrepancies. Our method
endows LLM with diverse perspectives to alleviate stubborn biases. Moreover,
their discrepancies indicate potential errors or inherent uncertainties that
LLM often overlooks. Reflecting upon these can catalyze more accurate and
stable reflection. Experiments conducted on a series of reasoning and
translation tasks with different LLMs serve to underscore the effectiveness and
generality of our strategy.
♻ ☆ Generalization Bounds: Perspectives from Information Theory and PAC-Bayes
A fundamental question in theoretical machine learning is generalization.
Over the past decades, the PAC-Bayesian approach has been established as a
flexible framework to address the generalization capabilities of machine
learning algorithms, and design new ones. Recently, it has garnered increased
interest due to its potential applicability for a variety of learning
algorithms, including deep neural networks. In parallel, an
information-theoretic view of generalization has developed, wherein the
relation between generalization and various information measures has been
established. This framework is intimately connected to the PAC-Bayesian
approach, and a number of results have been independently discovered in both
strands. In this monograph, we highlight this strong connection and present a
unified treatment of PAC-Bayesian and information-theoretic generalization
bounds. We present techniques and results that the two perspectives have in
common, and discuss the approaches and interpretations that differ. In
particular, we demonstrate how many proofs in the area share a modular
structure, through which the underlying ideas can be intuited. We pay special
attention to the conditional mutual information (CMI) framework; analytical
studies of the information complexity of learning algorithms; and the
application of the proposed methods to deep learning. This monograph is
intended to provide a comprehensive introduction to information-theoretic
generalization bounds and their connection to PAC-Bayes, serving as a
foundation from which the most recent developments are accessible. It is aimed
broadly towards researchers with an interest in generalization and theoretical
machine learning.
comment: 228 pages
♻ ☆ Decoupled Data Consistency with Diffusion Purification for Image Restoration
Diffusion models have recently gained traction as a powerful class of deep
generative priors, excelling in a wide range of image restoration tasks due to
their exceptional ability to model data distributions. To solve image
restoration problems, many existing techniques achieve data consistency by
incorporating additional likelihood gradient steps into the reverse sampling
process of diffusion models. However, the additional gradient steps pose a
challenge for real-world practical applications as they incur a large
computational overhead, thereby increasing inference time. They also present
additional difficulties when using accelerated diffusion model samplers, as the
number of data consistency steps is limited by the number of reverse sampling
steps. In this work, we propose a novel diffusion-based image restoration
solver that addresses these issues by decoupling the reverse process from the
data consistency steps. Our method involves alternating between a
reconstruction phase to maintain data consistency and a refinement phase that
enforces the prior via diffusion purification. Our approach demonstrates
versatility, making it highly adaptable for efficient problem-solving in latent
space. Additionally, it reduces the necessity for numerous sampling steps
through the integration of consistency models. The efficacy of our approach is
validated through comprehensive experiments across various image restoration
tasks, including image denoising, deblurring, inpainting, and super-resolution.
♻ ☆ FedSN: A Novel Federated Learning Framework over LEO Satellite Networks
Recently, a large number of Low Earth Orbit (LEO) satellites have been
launched and deployed successfully in space by commercial companies, such as
SpaceX. Due to multimodal sensors equipped by the LEO satellites, they serve
not only for communication but also for various machine learning applications,
such as space modulation recognition, remote sensing image classification, etc.
However, the ground station (GS) may be incapable of downloading such a large
volume of raw sensing data for centralized model training due to the limited
contact time with LEO satellites (e.g. 5 minutes). Therefore, federated
learning (FL) has emerged as the promising solution to address this problem via
on-device training. Unfortunately, to enable FL on LEO satellites, we still
face three critical challenges that are i) heterogeneous computing and memory
capabilities, ii) limited uplink rate, and iii) model staleness. To this end,
we propose FedSN as a general FL framework to tackle the above challenges, and
fully explore data diversity on LEO satellites. Specifically, we first present
a novel sub-structure scheme to enable heterogeneous local model training
considering different computing, memory, and communication constraints on LEO
satellites. Additionally, we propose a pseudo-synchronous model aggregation
strategy to dynamically schedule model aggregation for compensating model
staleness. To further demonstrate the effectiveness of the FedSN, we evaluate
it using space modulation recognition and remote sensing image classification
tasks by leveraging the data from real-world satellite networks. Extensive
experimental results demonstrate that FedSN framework achieves higher accuracy,
lower computing, and communication overhead than the state-of-the-art
benchmarks and the effectiveness of each components in FedSN.
comment: 14 pages, 17 figures
♻ ☆ Nonlinear Control Allocation: A Learning Based Approach
Modern aircraft are designed with redundant control effectors to cater for
fault tolerance and maneuverability requirements. This leads to aircraft being
over-actuated and requires control allocation schemes to distribute the control
commands among control effectors. Traditionally, optimization-based control
allocation schemes are used; however, for nonlinear allocation problems, these
methods require large computational resources. In this work, an artificial
neural network (ANN) based nonlinear control allocation scheme is proposed. The
proposed scheme is composed of learning the inverse of the control
effectiveness map through ANN, and then implementing it as an allocator instead
of solving an online optimization problem. Stability conditions are presented
for closed-loop systems incorporating the allocator, and computational
challenges are explored with piece-wise linear effectiveness functions and
ANN-based allocators. To demonstrate the efficacy of the proposed scheme, it is
compared with a standard quadratic programming-based method for control
allocation.
comment: submitted to IEEE Conference on Decision and Control (CDC), 2024
♻ ☆ NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
While recent large-scale text-to-speech (TTS) models have achieved
significant progress, they still fall short in speech quality, similarity, and
prosody. Considering speech intricately encompasses various attributes (e.g.,
content, prosody, timbre, and acoustic details) that pose significant
challenges for generation, a natural idea is to factorize speech into
individual subspaces representing different attributes and generate them
individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with
novel factorized diffusion models to generate natural speech in a zero-shot
way. Specifically, 1) we design a neural codec with factorized vector
quantization (FVQ) to disentangle speech waveform into subspaces of content,
prosody, timbre, and acoustic details; 2) we propose a factorized diffusion
model to generate attributes in each subspace following its corresponding
prompt. With this factorization design, NaturalSpeech 3 can effectively and
efficiently model intricate speech with disentangled subspaces in a
divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the
state-of-the-art TTS systems on quality, similarity, prosody, and
intelligibility, and achieves on-par quality with human recordings.
Furthermore, we achieve better performance by scaling to 1B parameters and 200K
hours of training data.
comment: Achieving human-level quality and naturalness on multi-speaker
datasets (e.g., LibriSpeech) in a zero-shot way
♻ ☆ ChatGPT Needs SPADE (Sustainability, PrivAcy, Digital divide, and Ethics) Evaluation: A Review
ChatGPT is another large language model (LLM) vastly available for the
consumers on their devices but due to its performance and ability to converse
effectively, it has gained a huge popularity amongst research as well as
industrial community. Recently, many studies have been published to show the
effectiveness, efficiency, integration, and sentiments of chatGPT and other
LLMs. In contrast, this study focuses on the important aspects that are mostly
overlooked, i.e. sustainability, privacy, digital divide, and ethics and
suggests that not only chatGPT but every subsequent entry in the category of
conversational bots should undergo Sustainability, PrivAcy, Digital divide, and
Ethics (SPADE) evaluation. This paper discusses in detail the issues and
concerns raised over chatGPT in line with aforementioned characteristics. We
also discuss the recent EU AI Act briefly in accordance with the SPADE
evaluation. We support our hypothesis by some preliminary data collection and
visualizations along with hypothesized facts. We also suggest mitigations and
recommendations for each of the concerns. Furthermore, we also suggest some
policies and recommendations for EU AI policy act concerning ethics, digital
divide, and sustainability.
comment: 29 pages, 8 figures, 4 tables
♻ ☆ Incorporating simulated spatial context information improves the effectiveness of contrastive learning models
Visual learning often occurs in a specific context, where an agent acquires
skills through exploration and tracking of its location in a consistent
environment. The historical spatial context of the agent provides a similarity
signal for self-supervised contrastive learning. We present a unique approach,
termed Environmental Spatial Similarity (ESS), that complements existing
contrastive learning methods. Using images from simulated, photorealistic
environments as an experimental setting, we demonstrate that ESS outperforms
traditional instance discrimination approaches. Moreover, sampling additional
data from the same environment substantially improves accuracy and provides new
augmentations. ESS allows remarkable proficiency in room classification and
spatial prediction tasks, especially in unfamiliar environments. This learning
paradigm has the potential to enable rapid visual learning in agents operating
in new environments with unique visual characteristics. Potentially
transformative applications span from robotics to space exploration. Our proof
of concept demonstrates improved efficiency over methods that rely on
extensive, disconnected datasets.
♻ ☆ Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling AISTATS
Arman Adibi, Nicolo Dal Fabbro, Luca Schenato, Sanjeev Kulkarni, H. Vincent Poor, George J. Pappas, Hamed Hassani, Aritra Mitra
Motivated by applications in large-scale and multi-agent reinforcement
learning, we study the non-asymptotic performance of stochastic approximation
(SA) schemes with delayed updates under Markovian sampling. While the effect of
delays has been extensively studied for optimization, the manner in which they
interact with the underlying Markov process to shape the finite-time
performance of SA remains poorly understood. In this context, our first main
contribution is to show that under time-varying bounded delays, the delayed SA
update rule guarantees exponentially fast convergence of the \emph{last
iterate} to a ball around the SA operator's fixed point. Notably, our bound is
\emph{tight} in its dependence on both the maximum delay $\tau_{max}$, and the
mixing time $\tau_{mix}$. To achieve this tight bound, we develop a novel
inductive proof technique that, unlike various existing delayed-optimization
analyses, relies on establishing uniform boundedness of the iterates. As such,
our proof may be of independent interest. Next, to mitigate the impact of the
maximum delay on the convergence rate, we provide the first finite-time
analysis of a delay-adaptive SA scheme under Markovian sampling. In particular,
we show that the exponent of convergence of this scheme gets scaled down by
$\tau_{avg}$, as opposed to $\tau_{max}$ for the vanilla delayed SA rule; here,
$\tau_{avg}$ denotes the average delay across all iterations. Moreover, the
adaptive scheme requires no prior knowledge of the delay sequence for step-size
tuning. Our theoretical findings shed light on the finite-time effects of
delays for a broad class of algorithms, including TD learning, Q-learning, and
stochastic gradient descent under Markovian sampling.
comment: Accepted to the 27th International Conference on Artificial
Intelligence and Statistics (AISTATS) 2024!
♻ ☆ Adaptive Negative Evidential Deep Learning for Open-set Semi-supervised Learning AAAI2024
Semi-supervised learning (SSL) methods assume that labeled data, unlabeled
data and test data are from the same distribution. Open-set semi-supervised
learning (Open-set SSL) considers a more practical scenario, where unlabeled
data and test data contain new categories (outliers) not observed in labeled
data (inliers). Most previous works focused on outlier detection via binary
classifiers, which suffer from insufficient scalability and inability to
distinguish different types of uncertainty. In this paper, we propose a novel
framework, Adaptive Negative Evidential Deep Learning (ANEDL) to tackle these
limitations. Concretely, we first introduce evidential deep learning (EDL) as
an outlier detector to quantify different types of uncertainty, and design
different uncertainty metrics for self-training and inference. Furthermore, we
propose a novel adaptive negative optimization strategy, making EDL more
tailored to the unlabeled dataset containing both inliers and outliers. As
demonstrated empirically, our proposed method outperforms existing
state-of-the-art methods across four datasets.
comment: Accepted by AAAI2024
♻ ☆ SeSaMe: A Framework to Simulate Self-Reported Ground Truth for Mental Health Sensing Studies
Advances in mobile and wearable technologies have enabled the potential to
passively monitor a person's mental, behavioral, and affective health. These
approaches typically rely on longitudinal collection of self-reported outcomes,
e.g., depression, stress, and anxiety, to train machine learning (ML) models.
However, the need to continuously self-report adds a significant burden on the
participants, often resulting in attrition, missing labels, or insincere
responses. In this work, we introduce the Scale Scores Simulation using Mental
Models (SeSaMe) framework to alleviate participants' burden in digital mental
health studies. By leveraging pre-trained large language models (LLMs), SeSaMe
enables the simulation of participants' responses on psychological scales. In
SeSaMe, researchers can prompt LLMs with information on participants' internal
behavioral dispositions, enabling LLMs to construct mental models of
participants to simulate their responses on psychological scales. We
demonstrate an application of SeSaMe, where we use GPT-4 to simulate responses
on one scale using responses from another as behavioral information. We also
evaluate the alignment between human and SeSaMe-simulated responses to
psychological scales. Then, we present experiments to inspect the utility of
SeSaMe-simulated responses as ground truth in training ML models by replicating
established depression and anxiety screening tasks from a previous study. Our
results indicate SeSaMe to be a promising approach, but its alignment may vary
across scales and specific prediction objectives. We also observed that model
performance with simulated data was on par with using the real data for
training in most evaluation scenarios. We conclude by discussing the potential
implications of SeSaMe in addressing some challenges researchers face with
ground-truth collection in passive sensing studies.
♻ ☆ Byzantine-resilient Federated Learning With Adaptivity to Data Heterogeneity
This paper deals with federated learning (FL) in the presence of malicious
Byzantine attacks and data heterogeneity. A novel Robust Average Gradient
Algorithm (RAGA) is proposed, which leverages the geometric median for
aggregation and can freely select the round number for local updating.
Different from most existing resilient approaches, which perform convergence
analysis based on strongly-convex loss function or homogeneously distributed
dataset, we conduct convergence analysis for not only strongly-convex but also
non-convex loss function over heterogeneous dataset. According to our
theoretical analysis, as long as the fraction of dataset from malicious users
is less than half, RAGA can achieve convergence at rate
$\mathcal{O}({1}/{T^{2/3- \delta}})$ where $T$ is the iteration number and
$\delta \in (0, 2/3)$ for non-convex loss function, and at linear rate for
strongly-convex loss function. Moreover, stationary point or global optimal
solution is proved to obtainable as data heterogeneity vanishes. Experimental
results corroborate the robustness of RAGA to Byzantine attacks and verifies
the advantage of RAGA over baselines on convergence performance under various
intensity of Byzantine attacks, for heterogeneous dataset.
♻ ☆ Demystifying Misconceptions in Social Bots Research
Stefano Cresci, Kai-Cheng Yang, Angelo Spognardi, Roberto Di Pietro, Filippo Menczer, Marinella Petrocchi
Research on social bots aims at advancing knowledge and providing solutions
to one of the most debated forms of online manipulation. Yet, social bot
research is plagued by widespread biases, hyped results, and misconceptions
that set the stage for ambiguities, unrealistic expectations, and seemingly
irreconcilable findings. Overcoming such issues is instrumental towards
ensuring reliable solutions and reaffirming the validity of the scientific
method. In this contribution, we review some recent results in social bots
research, highlighting and revising factual errors as well as methodological
and conceptual biases. More importantly, we demystify common misconceptions,
addressing fundamental points on how social bots research is discussed. Our
analysis surfaces the need to discuss research about online disinformation and
manipulation in a rigorous, unbiased, and responsible way. This article
bolsters such effort by identifying and refuting common fallacious arguments
used by both proponents and opponents of social bots research, as well as
providing directions toward sound methodologies for future research in the
field.
♻ ☆ DeepMachining: Online Prediction of Machining Errors of Lathe Machines
We describe DeepMachining, a deep learning-based AI system for online
prediction of machining errors of lathe machine operations. We have built and
evaluated DeepMachining based on manufacturing data from factories.
Specifically, we first pretrain a deep learning model for a given lathe
machine's operations to learn the salient features of machining states. Then,
we fine-tune the pretrained model to adapt to specific machining tasks. We
demonstrate that DeepMachining achieves high prediction accuracy for multiple
tasks that involve different workpieces and cutting tools. To the best of our
knowledge, this work is one of the first factory experiments using pre-trained
deep-learning models to predict machining errors of lathe machines.
♻ ☆ Structure Guided Large Language Model for SQL Generation
Generating accurate Structured Querying Language (SQL) is a long-standing
problem, especially in matching users' semantic queries with structured
databases and then generating structured SQL. Existing models typically input
queries and database schemas into the LLM and rely on the LLM to perform
semantic-structure matching and generate structured SQL. However, such
solutions overlook the structural information within user queries and
databases, which can be utilized to enhance the generation of structured SQL.
This oversight can lead to inaccurate or unexecutable SQL generation. To fully
exploit the structure, we propose a structure-to-SQL framework, which leverages
the inherent structure information to improve the SQL generation of LLMs.
Specifically, we introduce our Structure Guided SQL~(SGU-SQL) generation model.
SGU-SQL first links user queries and databases in a structure-enhanced manner.
It then decomposes complicated linked structures with grammar trees to guide
the LLM to generate the SQL step by step. Extensive experiments on two
benchmark datasets illustrate that SGU-SQL can outperform sixteen SQL
generation baselines.
♻ ☆ Recurrent Action Transformer with Memory
Recently, the use of transformers in offline reinforcement learning has
become a rapidly developing area. This is due to their ability to treat the
agent's trajectory in the environment as a sequence, thereby reducing the
policy learning problem to sequence modeling. In environments where the agent's
decisions depend on past events, it is essential to capture both the event
itself and the decision point in the context of the model. However, the
quadratic complexity of the attention mechanism limits the potential for
context expansion. One solution to this problem is to enhance transformers with
memory mechanisms. In this paper, we propose the Recurrent Action Transformer
with Memory (RATE) - a model that incorporates recurrent memory. To evaluate
our model, we conducted extensive experiments on both memory-intensive
environments (VizDoom-Two-Color, T-Maze) and classic Atari games and MuJoCo
control environments. The results show that the use of memory can significantly
improve performance in memory-intensive environments while maintaining or
improving results in classic environments. We hope that our findings will
stimulate research on memory mechanisms for transformers applicable to offline
reinforcement learning.
comment: 15 pages, 11 figures
♻ ☆ Generative Pre-Training of Time-Series Data for Unsupervised Fault Detection in Semiconductor Manufacturing
This paper introduces TRACE-GPT, which stands for Time-seRies
Anomaly-detection with Convolutional Embedding and Generative Pre-trained
Transformers. TRACE-GPT is designed to pre-train univariate time-series sensor
data and detect faults on unlabeled datasets in semiconductor manufacturing. In
semiconductor industry, classifying abnormal time-series sensor data from
normal data is important because it is directly related to wafer defect.
However, small, unlabeled, and even mixed training data without enough
anomalies make classification tasks difficult. In this research, we capture
features of time-series data with temporal convolutional embedding and
Generative Pre-trained Transformer (GPT) to classify abnormal sequences from
normal sequences using cross entropy loss. We prove that our model shows better
performance than previous unsupervised models with both an open dataset, the
University of California Riverside (UCR) time-series classification archive,
and the process log of our Chemical Vapor Deposition (CVD) equipment. Our model
has the highest F1 score at Equal Error Rate (EER) across all datasets and is
only 0.026 below the supervised state-of-the-art baseline on the open dataset.
♻ ☆ Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey NAACL 2024
Large Language Models (LLMs) are now commonplace in conversation
applications. However, their risks of misuse for generating harmful responses
have raised serious societal concerns and spurred recent research on LLM
conversation safety. Therefore, in this survey, we provide a comprehensive
overview of recent studies, covering three critical aspects of LLM conversation
safety: attacks, defenses, and evaluations. Our goal is to provide a structured
summary that enhances understanding of LLM conversation safety and encourages
further investigation into this important subject. For easy reference, we have
categorized all the studies mentioned in this survey according to our taxonomy,
available at: https://github.com/niconi19/LLM-conversation-safety.
comment: Accepted to NAACL 2024
♻ ☆ Shapley Values-Powered Framework for Fair Reward Split in Content Produced by GenAI
It is evident that, currently, generative models are surpassed in quality by
human professionals. However, with the advancements in Artificial Intelligence,
this gap will narrow, leading to scenarios where individuals who have dedicated
years of their lives to mastering a skill become obsolete due to their high
costs, which are inherently linked to the time they require to complete a task
-- a task that AI could accomplish in minutes or seconds. To avoid future
social upheavals, we must, even now, contemplate how to fairly assess the
contributions of such individuals in training generative models and how to
compensate them for the reduction or complete loss of their incomes. In this
work, we propose a method to structure collaboration between model developers
and data providers. To achieve this, we employ Shapley Values to quantify the
contribution of artist(s) in an image generated by the Stable Diffusion-v1.5
model and to equitably allocate the reward among them.
comment: 36 pages, 32 figures
♻ ☆ ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models
Mohi Reza, Nathan Laundry, Ilya Musabirov, Peter Dushniku, Zhi Yuan "Michael" Yu, Kashish Mittal, Tovi Grossman, Michael Liut, Anastasia Kuzminykh, Joseph Jay Williams
Exploring alternative ideas by rewriting text is integral to the writing
process. State-of-the-art Large Language Models (LLMs) can simplify writing
variation generation. However, current interfaces pose challenges for
simultaneous consideration of multiple variations: creating new variations
without overwriting text can be difficult, and pasting them sequentially can
clutter documents, increasing workload and disrupting writers' flow. To tackle
this, we present ABScribe, an interface that supports rapid, yet visually
structured, exploration and organization of writing variations in human-AI
co-writing tasks. With ABScribe, users can swiftly modify variations using LLM
prompts, which are auto-converted into reusable buttons. Variations are stored
adjacently within text fields for rapid in-place comparisons using mouse-over
interactions on a popup toolbar. Our user study with 12 writers shows that
ABScribe significantly reduces task workload (d = 1.20, p < 0.001), enhances
user perceptions of the revision process (d = 2.41, p < 0.001) compared to a
popular baseline workflow, and provides insights into how writers explore
variations using LLMs.
comment: CHI 2024
♻ ☆ CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning CVPR 2024
Partial-label learning (PLL) is an important weakly supervised learning
problem, which allows each training example to have a candidate label set
instead of a single ground-truth label. Identification-based methods have been
widely explored to tackle label ambiguity issues in PLL, which regard the true
label as a latent variable to be identified. However, identifying the true
labels accurately and completely remains challenging, causing noise in pseudo
labels during model training. In this paper, we propose a new method called
CroSel, which leverages historical predictions from the model to identify true
labels for most training examples. First, we introduce a cross selection
strategy, which enables two deep models to select true labels of partially
labeled data for each other. Besides, we propose a novel consistency
regularization term called co-mix to avoid sample waste and tiny noise caused
by false selection. In this way, CroSel can pick out the true labels of most
examples with high precision. Extensive experiments demonstrate the superiority
of CroSel, which consistently outperforms previous state-of-the-art methods on
benchmark datasets. Additionally, our method achieves over 90\% accuracy and
quantity for selecting true labels on CIFAR-type datasets under various
settings.
comment: Accepted by CVPR 2024
♻ ☆ Challenging Common Paradigms in Multi-Task Learning
While multi-task learning (MTL) has gained significant attention in recent
years, its underlying mechanisms remain poorly understood. Recent methods did
not yield consistent performance improvements over single task learning (STL)
baselines, underscoring the importance of gaining more profound insights about
challenges specific to MTL. In our study, we challenge paradigms in MTL in the
context of STL: First, the impact of the choice of optimizer has only been
mildly investigated in MTL. We show the pivotal role of common STL tools such
as the Adam optimizer in MTL empirically in various experiments. To further
investigate Adam's effectiveness, we theoretical derive a partial loss-scale
invariance under mild assumptions. Second, the notion of gradient conflicts has
often been phrased as a specific problem in MTL. We delve into the role of
gradient conflicts in MTL and compare it to STL. For angular gradient alignment
we find no evidence that this is a unique problem in MTL. We emphasize
differences in gradient magnitude as the main distinguishing factor. Lastly, we
compare the transferability of features learned through MTL and STL on common
image corruptions, and find light evidence that MTL can lead to superior
transferability. Overall, we find surprising similarities between STL and MTL
suggesting to consider methods from both fields in a broader context.
comment: -
♻ ☆ Solving a Real-World Package Delivery Routing Problem Using Quantum Annealers
Research focused on the conjunction between quantum computing and routing
problems has been very prolific in recent years. Most of the works revolve
around classical problems such as the Traveling Salesman Problem or the Vehicle
Routing Problem. Even though working on these problems is valuable, it is also
undeniable that their academic-oriented nature falls short of real-world
requirements. The main objective of this research is to present a solving
method for realistic instances, avoiding problem relaxations or technical
shortcuts. Instead, a quantum-classical hybrid solver has been developed,
coined Q4RPD, that considers a set of real constraints such as a heterogeneous
fleet of vehicles, priority deliveries, and capacities characterized by two
values: weight and dimensions of the packages. Q4RPD resorts to the Leap
Constrained Quadratic Model Hybrid Solver of D-Wave. To demonstrate the
application of Q4RPD, an experimentation composed of six different instances
has been conducted, aiming to serve as illustrative examples.
comment: 15 pages, 11 figures and 4 tables. Paper submitted for review in
Scientific Reports
♻ ☆ Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation CVPR 2024
Transformers have been successfully applied in the field of video-based 3D
human pose estimation. However, the high computational costs of these video
pose transformers (VPTs) make them impractical on resource-constrained devices.
In this paper, we present a plug-and-play pruning-and-recovering framework,
called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose
estimation from videos. Our HoT begins with pruning pose tokens of redundant
frames and ends with recovering full-length tokens, resulting in a few pose
tokens in the intermediate transformer blocks and thus improving the model
efficiency. To effectively achieve this, we propose a token pruning cluster
(TPC) that dynamically selects a few representative tokens with high semantic
diversity while eliminating the redundancy of video frames. In addition, we
develop a token recovering attention (TRA) to restore the detailed
spatio-temporal information based on the selected tokens, thereby expanding the
network output to the original full-length temporal resolution for fast
inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and
MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and
estimation accuracy compared to the original VPT models. For instance, applying
to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs
without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop,
respectively. Code and models are available at
https://github.com/NationalGAILab/HoT.
comment: Accepted by CVPR 2024, Open Sourced
♻ ☆ $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models NAACL2024
Prompt-based learning is a new language model training paradigm that adapts
the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes
the performance benchmarks across various natural language processing (NLP)
tasks. Instead of using a fixed prompt template to fine-tune the model, some
research demonstrates the effectiveness of searching for the prompt via
optimization. Such prompt optimization process of prompt-based learning on PLMs
also gives insight into generating adversarial prompts to mislead the model,
raising concerns about the adversarial vulnerability of this paradigm. Recent
studies have shown that universal adversarial triggers (UATs) can be generated
to alter not only the predictions of the target PLMs but also the prediction of
corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based
learning paradigm. However, UATs found in previous works are often unreadable
tokens or characters and can be easily distinguished from natural texts with
adaptive defenses. In this work, we consider the naturalness of the UATs and
develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs
by a gradient-based beam search algorithm that not only effectively attacks the
target PLMs and PFMs but also maintains the naturalness among the trigger
tokens. Extensive results demonstrate the effectiveness of
$\textit{LinkPrompt}$, as well as the transferability of UATs generated by
$\textit{LinkPrompt}$ to open-sourced Large Language Model (LLM) Llama2 and
API-accessed LLM GPT-3.5-turbo.
comment: Accepted to the main conference of NAACL2024
♻ ☆ LLatrieval: LLM-Verified Retrieval for Verifiable Generation NAACL 2024
Verifiable generation aims to let the large language model (LLM) generate
text with supporting documents, which enables the user to flexibly verify the
answer and makes the LLM's output more reliable. Retrieval plays a crucial role
in verifiable generation. Specifically, the retrieved documents not only
supplement knowledge to help the LLM generate correct answers, but also serve
as supporting evidence for the user to verify the LLM's output. However, the
widely used retrievers become the bottleneck of the entire pipeline and limit
the overall performance. Their capabilities are usually inferior to LLMs since
they often have much fewer parameters than the large language model and have
not been demonstrated to scale well to the size of LLMs. If the retriever does
not correctly find the supporting documents, the LLM can not generate the
correct and verifiable answer, which overshadows the LLM's remarkable
abilities. To address these limitations, we propose \LLatrieval (Large Language
Model Verified Retrieval), where the LLM updates the retrieval result until it
verifies that the retrieved documents can sufficiently support answering the
question. Thus, the LLM can iteratively provide feedback to retrieval and
facilitate the retrieval result to fully support verifiable generation.
Experiments show that LLatrieval significantly outperforms extensive baselines
and achieves state-of-the-art results.
comment: Accepted by NAACL 2024 (Main Conference)
♻ ☆ Enhancing Object Coherence in Layout-to-Image Synthesis
Layout-to-image synthesis is an emerging technique in conditional image
generation. It aims to generate complex scenes, where users require fine
control over the layout of the objects in a scene. However, it remains
challenging to control the object coherence, including semantic coherence
(e.g., the cat looks at the flowers or not) and physical coherence (e.g., the
hand and the racket should not be misaligned). In this paper, we propose a
novel diffusion model with effective global semantic fusion (GSF) and
self-similarity feature enhancement modules to guide the object coherence for
this task. For semantic coherence, we argue that the image caption contains
rich information for defining the semantic relationship within the objects in
the images. Instead of simply employing cross-attention between captions and
generated images, which addresses the highly relevant layout restriction and
semantic coherence separately and thus leads to unsatisfying results shown in
our experiments, we develop GSF to fuse the supervision from the layout
restriction and semantic coherence requirement and exploit it to guide the
image synthesis process. Moreover, to improve the physical coherence, we
develop a Self-similarity Coherence Attention (SCA) module to explicitly
integrate local contextual physical coherence into each pixel's generation
process. Specifically, we adopt a self-similarity map to encode the coherence
restrictions and employ it to extract coherent features from text embedding.
Through visualization of our self-similarity map, we explore the essence of
SCA, revealing that its effectiveness is not only in capturing reliable
physical coherence patterns but also in enhancing complex texture generation.
Extensive experiments demonstrate the superiority of our proposed method in
both image generation quality and controllability.
♻ ☆ OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
To help the open-source community have a better understanding of
Mixture-of-Experts (MoE) based large language models (LLMs), we train and
release OpenMoE, a series of fully open-sourced and reproducible decoder-only
MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T
tokens. Our investigation confirms that MoE-based LLMs can offer a more
favorable cost-effectiveness trade-off than dense LLMs, highlighting the
potential effectiveness for future LLM development.
One more important contribution of this study is an in-depth analysis of the
routing mechanisms within our OpenMoE models, leading to three significant
findings: Context-Independent Specialization, Early Routing Learning, and
Drop-towards-the-End. We discovered that routing decisions in MoE models are
predominantly based on token IDs, with minimal context relevance. The
token-to-expert assignments are determined early in the pre-training phase and
remain largely unchanged. This imperfect routing can result in performance
degradation, particularly in sequential tasks like multi-turn conversations,
where tokens appearing later in a sequence are more likely to be dropped.
Finally, we rethink our design based on the above-mentioned observations and
analysis. To facilitate future MoE LLM development, we propose potential
strategies for mitigating the issues we found and further improving
off-the-shelf MoE LLM designs.
♻ ☆ VIGraph: Generative Self-supervised Learning for Class-Imbalanced Node Classification
Class imbalance in graph data presents significant challenges for node
classification. While existing methods, such as SMOTE-based approaches,
partially mitigate this issue, they still exhibit limitations in constructing
imbalanced graphs. Generative self-supervised learning (SSL) methods,
exemplified by graph autoencoders (GAEs), offer a promising solution by
directly generating minority nodes from the data itself, yet their potential
remains underexplored. In this paper, we delve into the shortcomings of
SMOTE-based approaches in the construction of imbalanced graphs. Furthermore,
we introduce VIGraph, a simple yet effective generative SSL approach that
relies on the Variational GAE as the fundamental model. VIGraph strictly
adheres to the concept of imbalance when constructing imbalanced graphs and
innovatively leverages the variational inference (VI) ability of Variational
GAE to generate nodes for minority classes. VIGraph introduces comprehensive
training strategies, including cross-view contrastive learning at the decoding
phase to capture semantic knowledge, adjacency matrix reconstruction to
preserve graph structure, and alignment strategy to ensure stable training.
VIGraph can generate high-quality nodes directly usable for classification,
eliminating the need to integrate the generated nodes back to the graph as well
as additional retraining found in SMOTE-based methods. We conduct extensive
experiments, results from which demonstrate the superiority and generality of
our approach.
♻ ☆ Learning Quadruped Locomotion Using Differentiable Simulation
While most recent advancements in legged robot control have been driven by
model-free reinforcement learning, we explore the potential of differentiable
simulation. Differentiable simulation promises faster convergence and more
stable training by computing low-variant first-order gradients using the robot
model, but so far, its use for legged robot control has remained limited to
simulation. The main challenge with differentiable simulation lies in the
complex optimization landscape of robotic tasks due to discontinuities in
contact-rich environments, e.g., quadruped locomotion. This work proposes a
new, differentiable simulation framework to overcome these challenges. The key
idea involves decoupling the complex whole-body simulation, which may exhibit
discontinuities due to contact, into two separate continuous domains.
Subsequently, we align the robot state resulting from the simplified model with
a more precise, non-differentiable simulator to maintain sufficient simulation
accuracy. Our framework enables learning quadruped walking in minutes using a
single simulated robot without any parallelization. When augmented with GPU
parallelization, our approach allows the quadruped robot to master diverse
locomotion skills, including trot, pace, bound, and gallop, on challenging
terrains in minutes. Additionally, our policy achieves robust locomotion
performance in the real world zero-shot. To the best of our knowledge, this
work represents the first demonstration of using differentiable simulation for
controlling a real quadruped robot. This work provides several important
insights into using differentiable simulations for legged locomotion in the
real world.
♻ ☆ Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang
Large Language Models (LLMs) showcase impressive capabilities but encounter
challenges like hallucination, outdated knowledge, and non-transparent,
untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has
emerged as a promising solution by incorporating knowledge from external
databases. This enhances the accuracy and credibility of the generation,
particularly for knowledge-intensive tasks, and allows for continuous knowledge
updates and integration of domain-specific information. RAG synergistically
merges LLMs' intrinsic knowledge with the vast, dynamic repositories of
external databases. This comprehensive review paper offers a detailed
examination of the progression of RAG paradigms, encompassing the Naive RAG,
the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the
tripartite foundation of RAG frameworks, which includes the retrieval, the
generation and the augmentation techniques. The paper highlights the
state-of-the-art technologies embedded in each of these critical components,
providing a profound understanding of the advancements in RAG systems.
Furthermore, this paper introduces up-to-date evaluation framework and
benchmark. At the end, this article delineates the challenges currently faced
and points out prospective avenues for research and development.
comment: Ongoing Work
♻ ☆ World Models via Policy-Guided Trajectory Diffusion
World models are a powerful tool for developing intelligent agents. By
predicting the outcome of a sequence of actions, world models enable policies
to be optimised via on-policy reinforcement learning (RL) using synthetic data,
i.e. in "in imagination". Existing world models are autoregressive in that they
interleave predicting the next state with sampling the next action from the
policy. Prediction error inevitably compounds as the trajectory length grows.
In this work, we propose a novel world modelling approach that is not
autoregressive and generates entire on-policy trajectories in a single pass
through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion
(PolyGRAD), leverages a denoising model in addition to the gradient of the
action distribution of the policy to diffuse a trajectory of initially random
states and actions into an on-policy synthetic trajectory. We analyse the
connections between PolyGRAD, score-based generative models, and
classifier-guided diffusion models. Our results demonstrate that PolyGRAD
outperforms state-of-the-art baselines in terms of trajectory prediction error
for short trajectories, with the exception of autoregressive diffusion. For
short trajectories, PolyGRAD obtains similar errors to autoregressive
diffusion, but with lower computational requirements. For long trajectories,
PolyGRAD obtains comparable performance to baselines. Our experiments
demonstrate that PolyGRAD enables performant policies to be trained via
on-policy RL in imagination for MuJoCo continuous control domains. Thus,
PolyGRAD introduces a new paradigm for accurate on-policy world modelling
without autoregressive sampling.
comment: Published in TMLR, March 2024
♻ ☆ Functional Graph Convolutional Networks: A unified multi-task and multi-modal learning framework to facilitate health and social-care insights
Tobia Boschi, Francesca Bonin, Rodrigo Ordonez-Hurtado, Cécile Rousseau, Alessandra Pascale, John Dinsmore
This paper introduces a novel Functional Graph Convolutional Network (funGCN)
framework that combines Functional Data Analysis and Graph Convolutional
Networks to address the complexities of multi-task and multi-modal learning in
digital health and longitudinal studies. With the growing importance of health
solutions to improve health care and social support, ensure healthy lives, and
promote well-being at all ages, funGCN offers a unified approach to handle
multivariate longitudinal data for multiple entities and ensures
interpretability even with small sample sizes. Key innovations include
task-specific embedding components that manage different data types, the
ability to perform classification, regression, and forecasting, and the
creation of a knowledge graph for insightful data interpretation. The efficacy
of funGCN is validated through simulation experiments and a real-data
application.
♻ ☆ Learning Concept-Based Causal Transition and Symbolic Reasoning for Visual Planning
Visual planning simulates how humans make decisions to achieve desired goals
in the form of searching for visual causal transitions between an initial
visual state and a final visual goal state. It has become increasingly
important in egocentric vision with its advantages in guiding agents to perform
daily tasks in complex environments. In this paper, we propose an interpretable
and generalizable visual planning framework consisting of i) a novel
Substitution-based Concept Learner (SCL) that abstracts visual inputs into
disentangled concept representations, ii) symbol abstraction and reasoning that
performs task planning via the self-learned symbols, and iii) a Visual Causal
Transition model (ViCT) that grounds visual causal transitions to semantically
similar real-world actions. Given an initial state, we perform goal-conditioned
visual planning with a symbolic reasoning method fueled by the learned
representations and causal transitions to reach the goal state. To verify the
effectiveness of the proposed model, we collect a large-scale visual planning
dataset based on AI2-THOR, dubbed as CCTP. Extensive experiments on this
challenging dataset demonstrate the superior performance of our method in
visual task planning. Empirically, we show that our framework can generalize to
unseen task trajectories, unseen object categories, and real-world data.
Further details of this work are provided at
https://fqyqc.github.io/ConTranPlan/.
♻ ☆ Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space EACL 2023
Prior research has investigated the impact of various linguistic features on
cross-lingual transfer performance. In this study, we investigate the manner in
which this effect can be mapped onto the representation space. While past
studies have focused on the impact on cross-lingual alignment in multilingual
language models during fine-tuning, this study examines the absolute evolution
of the respective language representation spaces produced by MLLMs. We place a
specific emphasis on the role of linguistic characteristics and investigate
their inter-correlation with the impact on representation spaces and
cross-lingual transfer performance. Additionally, this paper provides
preliminary evidence of how these findings can be leveraged to enhance transfer
to linguistically distant languages.
comment: SIGTYP Workshop 2023 (co-located with EACL 2023)
♻ ☆ The Effects of Mixed Sample Data Augmentation are Class Dependent
Mixed Sample Data Augmentation (MSDA) techniques, such as Mixup, CutMix, and
PuzzleMix, have been widely acknowledged for enhancing performance in a variety
of tasks. A previous study reported the class dependency of traditional data
augmentation (DA), where certain classes benefit disproportionately compared to
others. This paper reveals a class dependent effect of MSDA, where some classes
experience improved performance while others experience degraded performance.
This research addresses the issue of class dependency in MSDA and proposes an
algorithm to mitigate it. The approach involves training on a mixture of MSDA
and non-MSDA data, which not only mitigates the negative impact on the affected
classes, but also improves overall accuracy. Furthermore, we provide in-depth
analysis and discussion of why MSDA introduced class dependencies and which
classes are most likely to have them.
comment: 21 pages, 18 figures, Overall Revision
♻ ☆ Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation CVPR2024
Although 3D shape matching and interpolation are highly interrelated, they
are often studied separately and applied sequentially to relate different 3D
shapes, thus resulting in sub-optimal performance. In this work we present a
unified framework to predict both point-wise correspondences and shape
interpolation between 3D shapes. To this end, we combine the deep functional
map framework with classical surface deformation models to map shapes in both
spectral and spatial domains. On the one hand, by incorporating spatial maps,
our method obtains more accurate and smooth point-wise correspondences compared
to previous functional map methods for shape matching. On the other hand, by
introducing spectral maps, our method gets rid of commonly used but
computationally expensive geodesic distance constraints that are only valid for
near-isometric shape deformations. Furthermore, we propose a novel test-time
adaptation scheme to capture both pose-dominant and shape-dominant
deformations. Using different challenging datasets, we demonstrate that our
method outperforms previous state-of-the-art methods for both shape matching
and interpolation, even compared to supervised approaches.
comment: accepted by CVPR2024
♻ ☆ Towards Regulatable AI Systems: Technical Gaps and Policy Opportunities
Xudong Shen, Hannah Brown, Jiashu Tao, Martin Strobel, Yao Tong, Akshay Narayan, Harold Soh, Finale Doshi-Velez
There is increasing attention being given to how to regulate AI systems. As
governing bodies grapple with what values to encapsulate into regulation, we
consider the technical half of the question: To what extent can AI experts vet
an AI system for adherence to regulatory requirements? We investigate this
question through the lens of two public sector procurement checklists,
identifying what we can do now, what should be possible with technical
innovation, and what requirements need a more interdisciplinary approach.
comment: scheduled for publication in the Communications of the ACM, titled
"Directions of Technical Innovation for Regulatable AI Systems"
♻ ☆ MMP++: Motion Manifold Primitives with Parametric Curve Models
Motion Manifold Primitives (MMP), a manifold-based approach for encoding
basic motion skills, can produce diverse trajectories, enabling the system to
adapt to unseen constraints. Nonetheless, we argue that current MMP models lack
crucial functionalities of movement primitives, such as temporal and via-points
modulation, found in traditional approaches. This shortfall primarily stems
from MMP's reliance on discrete-time trajectories. To overcome these
limitations, we introduce Motion Manifold Primitives++ (MMP++), a new model
that integrates the strengths of both MMP and traditional methods by
incorporating parametric curve representations into the MMP framework.
Furthermore, we identify a significant challenge with MMP++: performance
degradation due to geometric distortions in the latent space, meaning that
similar motions are not closely positioned. To address this, Isometric Motion
Manifold Primitives++ (IMMP++) is proposed to ensure the latent space
accurately preserves the manifold's geometry. Our experimental results across
various applications, including 2-DoF planar motions, 7-DoF robot arm motions,
and SE(3) trajectory planning, show that MMP++ and IMMP++ outperform existing
methods in trajectory generation tasks, achieving substantial improvements in
some cases. Moreover, they enable the modulation of latent coordinates and
via-points, thereby allowing efficient online adaptation to dynamic
environments.
comment: 12 pages. This work has been submitted to the IEEE for possible
publication
♻ ☆ Regret-Based Defense in Adversarial Reinforcement Learning AAMAS 2024
Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable
to small adversarial noise in observations. Such adversarial noise can have
disastrous consequences in safety-critical environments. For instance, a
self-driving car receiving adversarially perturbed sensory observations about
nearby signs (e.g., a stop sign physically altered to be perceived as a speed
limit sign) or objects (e.g., cars altered to be recognized as trees) can be
fatal. Existing approaches for making RL algorithms robust to an
observation-perturbing adversary have focused on reactive approaches that
iteratively improve against adversarial examples generated at each iteration.
While such approaches have been shown to provide improvements over regular RL
methods, they are reactive and can fare significantly worse if certain
categories of adversarial examples are not generated during training. To that
end, we pursue a more proactive approach that relies on directly optimizing a
well-studied robustness measure, regret instead of expected value. We provide a
principled approach that minimizes maximum regret over a "neighborhood" of
observations to the received "observation". Our regret criterion can be used to
modify existing value- and policy-based Deep RL methods. We demonstrate that
our approaches provide a significant improvement in performance across a wide
variety of benchmarks against leading approaches for robust Deep RL.
comment: Accepted at AAMAS 2024
♻ ☆ MA4DIV: Multi-Agent Reinforcement Learning for Search Result Diversification
Yiqun Chen, Jiaxin Mao, Yi Zhang, Dehong Ma, Long Xia, Jun Fan, Daiting Shi, Zhicong Cheng, Simiu Gu, Dawei Yin
The objective of search result diversification (SRD) is to ensure that
selected documents cover as many different subtopics as possible. Existing
methods primarily utilize a paradigm of "greedy selection", i.e., selecting one
document with the highest diversity score at a time. These approaches tend to
be inefficient and are easily trapped in a suboptimal state. In addition, some
other methods aim to approximately optimize the diversity metric, such as
$\alpha$-NDCG, but the results still remain suboptimal. To address these
challenges, we introduce Multi-Agent reinforcement learning (MARL) for search
result DIVersity, which called MA4DIV. In this approach, each document is an
agent and the search result diversification is modeled as a cooperative task
among multiple agents. This approach allows for directly optimizing the
diversity metrics, such as $\alpha$-NDCG, while achieving high training
efficiency. We conducted preliminary experiments on public TREC datasets to
demonstrate the effectiveness and potential of MA4DIV. Considering the limited
number of queries in public TREC datasets, we construct a large-scale dataset
from industry sources and show that MA4DIV achieves substantial improvements in
both effectiveness and efficiency than existing baselines on a industrial scale
dataset.
♻ ☆ LLMs Are Few-Shot In-Context Low-Resource Language Learners
In-context learning (ICL) empowers large language models (LLMs) to perform
diverse tasks in underrepresented languages using only short in-context
information, offering a crucial avenue for narrowing the gap between
high-resource and low-resource languages. Nonetheless, there is only a handful
of works explored ICL for low-resource languages with most of them focusing on
relatively high-resource languages, such as French and Spanish. In this work,
we extensively study ICL and its cross-lingual variation (X-ICL) on 25
low-resource and 7 relatively higher-resource languages. Our study not only
assesses the effectiveness of ICL with LLMs in low-resource languages but also
identifies the shortcomings of in-context label alignment, and introduces a
more effective alternative: query alignment. Moreover, we provide valuable
insights into various facets of ICL for low-resource languages. Our study
concludes the significance of few-shot in-context information on enhancing the
low-resource understanding quality of LLMs through semantically relevant
information by closing the language gap in the target language and aligning the
semantics between the targeted low-resource and the high-resource language that
the model is proficient in. Our work highlights the importance of advancing ICL
research, particularly for low-resource languages.
♻ ☆ Enhancing Programming Education with ChatGPT: A Case Study on Student Perceptions and Interactions in a Python Course
The integration of ChatGPT as a supportive tool in education, notably in
programming courses, addresses the unique challenges of programming education
by providing assistance with debugging, code generation, and explanations.
Despite existing research validating ChatGPT's effectiveness, its application
in university-level programming education and a detailed understanding of
student interactions and perspectives remain limited. This paper explores
ChatGPT's impact on learning in a Python programming course tailored for
first-year students over eight weeks. By analyzing responses from surveys,
open-ended questions, and student-ChatGPT dialog data, we aim to provide a
comprehensive view of ChatGPT's utility and identify both its advantages and
limitations as perceived by students. Our study uncovers a generally positive
reception toward ChatGPT and offers insights into its role in enhancing the
programming education experience. These findings contribute to the broader
discourse on AI's potential in education, suggesting paths for future research
and application.
♻ ☆ SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces ICLR 2024
Given the remarkable achievements in image generation through diffusion
models, the research community has shown increasing interest in extending these
models to video generation. Recent diffusion models for video generation have
predominantly utilized attention layers to extract temporal features. However,
attention layers are limited by their memory consumption, which increases
quadratically with the length of the sequence. This limitation presents
significant challenges when attempting to generate longer video sequences using
diffusion models. To overcome this challenge, we propose leveraging state-space
models (SSMs). SSMs have recently gained attention as viable alternatives due
to their linear memory consumption relative to sequence length. In the
experiments, we first evaluate our SSM-based model with UCF101, a standard
benchmark of video generation. In addition, to investigate the potential of
SSMs for longer video generation, we perform an experiment using the MineRL
Navigate dataset, varying the number of frames to 64, 200, and 400. In these
settings, our SSM-based model can considerably save memory consumption for
longer sequences, while maintaining competitive FVD scores to the
attention-based models. Our codes are available at
https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.
comment: Accepted as workshop paper at ICLR 2024
♻ ☆ Weakly Supervised AUC Optimization: A Unified Partial AUC Approach
Since acquiring perfect supervision is usually difficult, real-world machine
learning tasks often confront inaccurate, incomplete, or inexact supervision,
collectively referred to as weak supervision. In this work, we present WSAUC, a
unified framework for weakly supervised AUC optimization problems, which covers
noisy label learning, positive-unlabeled learning, multi-instance learning, and
semi-supervised learning scenarios. Within the WSAUC framework, we first frame
the AUC optimization problems in various weakly supervised scenarios as a
common formulation of minimizing the AUC risk on contaminated sets, and
demonstrate that the empirical risk minimization problems are consistent with
the true AUC. Then, we introduce a new type of partial AUC, specifically, the
reversed partial AUC (rpAUC), which serves as a robust training objective for
AUC maximization in the presence of contaminated labels. WSAUC offers a
universal solution for AUC optimization in various weakly supervised scenarios
by maximizing the empirical rpAUC. Theoretical and experimental results under
multiple settings support the effectiveness of WSAUC on a range of weakly
supervised AUC optimization tasks.
comment: Accepted by IEEE TPAMI
♻ ☆ ProSwitch: Knowledge-Guided Language Model Fine-Tuning to Generate Professional and Non-Professional Styled Text
Large Language Models (LLMs) have demonstrated efficacy in various linguistic
applications, including text summarization and controlled text generation.
However, studies into their capacity of switching between styles via
fine-tuning remain underexplored. This study concentrates on textual
professionalism and introduces a novel methodology, named ProSwitch, which
equips a language model with the ability to produce both professional and
non-professional responses through knowledge-guided instruction tuning.
ProSwitch unfolds across three phases: data preparation for gathering domain
knowledge and training corpus; instruction tuning for optimizing language
models with multiple levels of instruction formats; and comprehensive
evaluation for assessing the professionalism discrimination and reference-based
quality of generated text. Comparative analysis of ProSwitch against both
general and specialized language models reveals that our approach outperforms
baselines in switching between professional and non-professional text
generation.
comment: 8 pages
♻ ☆ Re2LLM: Reflective Reinforcement Large Language Model for Session-based Recommendation
Large Language Models (LLMs) are emerging as promising approaches to enhance
session-based recommendation (SBR), where both prompt-based and
fine-tuning-based methods have been widely investigated to align LLMs with SBR.
However, the former methods struggle with optimal prompts to elicit the correct
reasoning of LLMs due to the lack of task-specific feedback, leading to
unsatisfactory recommendations. Although the latter methods attempt to
fine-tune LLMs with domain-specific knowledge, they face limitations such as
high computational costs and reliance on open-source backbones. To address such
issues, we propose a Reflective Reinforcement Large Language Model (Re2LLM) for
SBR, guiding LLMs to focus on specialized knowledge essential for more accurate
recommendations effectively and efficiently. In particular, we first design the
Reflective Exploration Module to effectively extract knowledge that is readily
understandable and digestible by LLMs. To be specific, we direct LLMs to
examine recommendation errors through self-reflection and construct a knowledge
base (KB) comprising hints capable of rectifying these errors. To efficiently
elicit the correct reasoning of LLMs, we further devise the Reinforcement
Utilization Module to train a lightweight retrieval agent. It learns to select
hints from the constructed KB based on the task-specific feedback, where the
hints can serve as guidance to help correct LLMs reasoning for better
recommendations. Extensive experiments on multiple real-world datasets
demonstrate that our method consistently outperforms state-of-the-art methods.
comment: 11 pages, 4 figures
♻ ☆ Dial-MAE: ConTextual Masked Auto-Encoder for Retrieval-based Dialogue Systems NAACL 2024
Dialogue response selection aims to select an appropriate response from
several candidates based on a given user and system utterance history. Most
existing works primarily focus on post-training and fine-tuning tailored for
cross-encoders. However, there are no post-training methods tailored for dense
encoders in dialogue response selection. We argue that when the current
language model, based on dense dialogue systems (such as BERT), is employed as
a dense encoder, it separately encodes dialogue context and response, leading
to a struggle to achieve the alignment of both representations. Thus, we
propose Dial-MAE (Dialogue Contextual Masking Auto-Encoder), a straightforward
yet effective post-training technique tailored for dense encoders in dialogue
response selection. Dial-MAE uses an asymmetric encoder-decoder architecture to
compress the dialogue semantics into dense vectors, which achieves better
alignment between the features of the dialogue context and response. Our
experiments have demonstrated that Dial-MAE is highly effective, achieving
state-of-the-art performance on two commonly evaluated benchmarks.
comment: This paper has been accepted by NAACL 2024
♻ ☆ SoftTiger: A Clinical Foundation Model for Healthcare Workflows
We introduce SoftTiger, a clinical large language model (CLaM) designed as a
foundation model for healthcare workflows. The narrative and unstructured
nature of clinical notes is a major obstacle for healthcare intelligentization.
We address a critical problem of structuring clinical notes into clinical data,
according to international interoperability standards. We collect and annotate
data for three subtasks, namely, international patient summary, clinical
impression and medical encounter. We then supervised fine-tuned a
state-of-the-art LLM using public and credentialed clinical data. The training
is orchestrated in a way that the target model can first support basic clinical
tasks such as abbreviation expansion and temporal information extraction, and
then learn to perform more complex downstream clinical tasks. Moreover, we
address several modeling challenges in the healthcare context, e.g., extra long
context window. Our blind pairwise evaluation shows that SoftTiger outperforms
other popular open-source models and GPT-3.5, comparable to Gemini-pro, with a
mild gap from GPT-4. We believe that LLMs may become a step-stone towards
healthcare digitalization and democratization. Therefore, we publicly release
SoftTiger models at scales of 13 billion and 70 billion parameters, as well as
datasets and code for our innovative scalable evaluation, hopefully, making a
significant contribution to the healthcare industry.
♻ ☆ Probing Multimodal Large Language Models for Global and Local Semantic Representations LREC
The advancement of Multimodal Large Language Models (MLLMs) has greatly
accelerated the development of applications in understanding integrated texts
and images. Recent works leverage image-caption datasets to train MLLMs,
achieving state-of-the-art performance on image-to-text tasks. However, there
are few studies exploring which layers of MLLMs make the most effort to the
global image information, which plays vital roles in multimodal comprehension
and generation. In this study, we find that the intermediate layers of models
can encode more global semantic information, whose representation vectors
perform better on visual-language entailment tasks, rather than the topmost
layers. We further probe models regarding local semantic representations
through object recognition tasks. We find that the topmost layers may
excessively focus on local information, leading to a diminished ability to
encode global information. Our code and data are released via
https://github.com/kobayashikanna01/probing_MLLM_rep.
comment: Accepted by LREC-COLING 2024 as a short paper (Camera Ready)
♻ ☆ In-Distribution and Out-of-Distribution Self-supervised ECG Representation Learning for Arrhythmia Detection
This paper presents a systematic investigation into the effectiveness of
Self-Supervised Learning (SSL) methods for Electrocardiogram (ECG) arrhythmia
detection. We begin by conducting a novel analysis of the data distributions on
three popular ECG-based arrhythmia datasets: PTB-XL, Chapman, and Ribeiro. To
the best of our knowledge, our study is the first to quantitatively explore and
characterize these distributions in the area. We then perform a comprehensive
set of experiments using different augmentations and parameters to evaluate the
effectiveness of various SSL methods, namely SimCRL, BYOL, and SwAV, for ECG
representation learning, where we observe the best performance achieved by
SwAV. Furthermore, our analysis shows that SSL methods achieve highly
competitive results to those achieved by supervised state-of-the-art methods.
To further assess the performance of these methods on both In-Distribution (ID)
and Out-of-Distribution (OOD) ECG data, we conduct cross-dataset training and
testing experiments. Our comprehensive experiments show almost identical
results when comparing ID and OOD schemes, indicating that SSL techniques can
learn highly effective representations that generalize well across different
OOD datasets. This finding can have major implications for ECG-based arrhythmia
detection. Lastly, to further analyze our results, we perform detailed
per-disease studies on the performance of the SSL methods on the three
datasets.
comment: This paper has been published in the IEEE Journal of Biomedical and
Health Informatics (JBHI). Copyright IEEE. Please cite as: S. Soltanieh, J.
Hashemi and A. Etemad, "In-Distribution and Out-of-Distribution
Self-Supervised ECG Representation Learning for Arrhythmia Detection," in
IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 2, pp.
789-800, Feb. 2024
♻ ☆ Imitating Cost-Constrained Behaviors in Reinforcement Learning ICAPS-24
Complex planning and scheduling problems have long been solved using various
optimization or heuristic approaches. In recent years, imitation learning that
aims to learn from expert demonstrations has been proposed as a viable
alternative to solving these problems. Generally speaking, imitation learning
is designed to learn either the reward (or preference) model or directly the
behavioral policy by observing the behavior of an expert. Existing work in
imitation learning and inverse reinforcement learning has focused on imitation
primarily in unconstrained settings (e.g., no limit on fuel consumed by the
vehicle). However, in many real-world domains, the behavior of an expert is
governed not only by reward (or preference) but also by constraints. For
instance, decisions on self-driving delivery vehicles are dependent not only on
the route preferences/rewards (depending on past demand data) but also on the
fuel in the vehicle and the time available. In such problems, imitation
learning is challenging as decisions are not only dictated by the reward model
but are also dependent on a cost-constrained model. In this paper, we provide
multiple methods that match expert distributions in the presence of trajectory
cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to
find a good trade-off between expected return and minimizing constraint
violation; and (c) Cost-violation-based alternating gradient. We empirically
show that leading imitation learning approaches imitate cost-constrained
behaviors poorly and our meta-gradient-based approach achieves the best
performance.
comment: Accepted to the 34th International Conference on Automated Planning
and Scheduling (ICAPS-24)
♻ ☆ The Innovation Paradox: Concept Space Expansion with Diminishing Originality and the Promise of Creative AI
Innovation, typically spurred by reusing, recombining, and synthesizing
existing concepts, is expected to result in an exponential growth of the
concept space over time. However, our statistical analysis of TechNet, which is
a comprehensive technology semantic network encompassing over four million
concepts derived from patent texts, reveals a linear rather than exponential
expansion of the overall technological concept space. Moreover, there is a
notable decline in the originality of newly created concepts. These trends can
be attributed to the constraints of human cognitive abilities to innovate
beyond an ever-growing space of prior art, among other factors. Integrating
creative artificial intelligence (CAI) into the innovation process holds the
potential to overcome these limitations and alter the observed trends in the
future.
comment: Forthcoming on the Design Science
♻ ☆ LLMs in Political Science: Heralding a New Era of Visual Analysis
Interest is increasing among political scientists in leveraging the extensive
information available in images. However, the challenge of interpreting these
images lies in the need for specialized knowledge in computer vision and access
to specialized hardware. As a result, image analysis has been limited to a
relatively small group within the political science community. This landscape
could potentially change thanks to the rise of large language models (LLMs).
This paper aims to raise awareness of the feasibility of using Gemini for image
content analysis. A retrospective analysis was conducted on a corpus of 688
images. Content reports were elicited from Gemini for each image and then
manually evaluated by the authors. We find that Gemini is highly accurate in
performing object detection, which is arguably the most common and fundamental
task in image analysis for political scientists. Equally important, we show
that it is easy to implement as the entire command consists of a single prompt
in natural language; it is fast to run and should meet the time budget of most
researchers; and it is free to use and does not require any specialized
hardware. In addition, we illustrate how political scientists can leverage
Gemini for other image understanding tasks, including face identification,
sentiment analysis, and caption generation. Our findings suggest that Gemini
and other similar LLMs have the potential to drastically stimulate and
accelerate image research in political science and social sciences more
broadly.
comment: 7 pages, 3 tables
♻ ☆ Coarse-Tuning for Ad-hoc Document Retrieval Using Pre-trained Language Models LREC
Fine-tuning in information retrieval systems using pre-trained language
models (PLM-based IR) requires learning query representations and
query-document relations, in addition to downstream task-specific learning.
This study introduces coarse-tuning as an intermediate learning stage that
bridges pre-training and fine-tuning. By learning query representations and
query-document relations in coarse-tuning, we aim to reduce the load of
fine-tuning and improve the learning effect of downstream IR tasks. We propose
Query-Document Pair Prediction (QDPP) for coarse-tuning, which predicts the
appropriateness of query-document pairs. Evaluation experiments show that the
proposed method significantly improves MRR and/or nDCG@5 in four ad-hoc
document retrieval datasets. Furthermore, the results of the query prediction
task suggested that coarse-tuning facilitated learning of query representation
and query-document relations.
comment: Accepted at LREC-COLING 2024
♻ ☆ Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models
Large language models (LLMs) still grapple with complex tasks like
mathematical reasoning. Despite significant efforts invested in improving
prefix prompts or reasoning process, the crucial role of problem context might
have been neglected. Accurate recognition of inputs is fundamental for solving
mathematical tasks, as ill-formed problems could potentially mislead LLM's
reasoning. In this study, we propose a new approach named Problem Elaboration
Prompting (PEP) to enhance the mathematical capacities of LLMs. Specifically,
PEP decomposes and elucidates the problem context before reasoning, therefore
enhancing the context modeling and parsing efficiency. Experiments across
datasets and models demonstrate promising performances: (1) PEP demonstrates an
overall enhancement in various mathematical tasks. For instance, with the
GPT-3.5 model, PEP exhibits improvements of 9.93% and 8.80% on GSM8k through
greedy decoding and self-consistency, respectively. (2) PEP can be easily
implemented and integrated with other prompting methods. (3) PEP shows
particular strength in handling distraction problems.
♻ ☆ Follower Agnostic Methods for Stackelberg Games
In this paper, we present an efficient algorithm to solve online Stackelberg
games, featuring multiple followers, in a follower-agnostic manner. Unlike
previous works, our approach works even when leader has no knowledge about the
followers' utility functions or strategy space. Our algorithm introduces a
unique gradient estimator, leveraging specially designed strategies to probe
followers. In a departure from traditional assumptions of optimal play, we
model followers' responses using a convergent adaptation rule, allowing for
realistic and dynamic interactions. The leader constructs the gradient
estimator solely based on observations of followers' actions. We provide both
non-asymptotic convergence rates to stationary points of the leader's objective
and demonstrate asymptotic convergence to a \emph{local Stackelberg
equilibrium}. To validate the effectiveness of our algorithm, we use this
algorithm to solve the problem of incentive design on a large-scale
transportation network, showcasing its robustness even when the leader lacks
access to followers' demand.
comment: 31 pages
♻ ☆ Learning to Act without Actions ICLR 2024
Pre-training large models on vast amounts of web data has proven to be an
effective approach for obtaining powerful, general models in domains such as
language and vision. However, this paradigm has not yet taken hold in
reinforcement learning. This is because videos, the most abundant form of
embodied behavioral data on the web, lack the action labels required by
existing methods for imitating behavior from demonstrations. We introduce
Latent Action Policies (LAPO), a method for recovering latent action
information, and thereby latent-action policies, world models, and inverse
dynamics models, purely from videos. LAPO is the first method able to recover
the structure of the true action space just from observed dynamics, even in
challenging procedurally-generated environments. LAPO enables training
latent-action policies that can be rapidly fine-tuned into expert-level
policies, either offline using a small action-labeled dataset, or online with
rewards. LAPO takes a first step towards pre-training powerful, generalist
policies and world models on the vast amounts of videos readily available on
the web.
comment: Accepted at ICLR 2024 (spotlight). The code can be found at
http://github.com/schmidtdominik/LAPO